SlideShare a Scribd company logo
The Video Document
                                 Bob Davies, Rainer Lienhart, and Boon-Lock Yeo
                                                  Intel Corporation
                                            Microcomputer Research Lab
                                            2200 Mission College Blvd.
                                           Santa Clara, CA 95052 - 8119
                              {Bob.Davies, Rainer.Lienhart, Boon-Lock.Yeo}

The metaphor of film and TV permeates the design of software to support video on the PC. Simply transplanting the non-
interactive, sequential experience of film to the PC fails to exploit the virtues of the new context. Video on the PC should be
interactive and non-sequential. This paper experiments with a variety of tools for using video on the PC that exploits the new
context of the PC. Some features are more successful than others. Applications that use these tools are explored, including
primarily the home video archive but also streaming video servers on the Internet. The ability to browse, edit, abstract and
index large volumes of video content such as home video and corporate video is a problem without appropriate solution in
today’ market. The current tools available are complex, unfriendly (professional) video editors, requiring hours of work to
prepare a short home video, far more work that a typical home user can be expected to provide.

Our proposed solution treats video like a text document, providing functionality similar to a text editor. Users can browse,
interact, edit and compose one or more video sequences with the same ease and convenience as handling text documents.
With this level of text-like composition, we call what is normally a sequential medium a “video document”. An important
component of the proposed solution is shot detection, the ability to detect when a shot started or stopped. When combined
with a spreadsheet of key frames, the shots become a grid of pictures that can be manipulated and viewed in the same way
that a spreadsheet can be edited. Multiple video documents may be viewed, joined, manipulated, and seamlessly played
back. Abstracts of unedited video content can be produced automatically to create novel video content for export to other
venues. Edited and raw video content can be published to the net or burned to a CD-ROM with a self-installing viewer for
Windows 98 and Windows NT 4. 0.

Keywords: Video grid, shot grid, video editor, MPEG-native editor, home video archive, shot detection, video abstraction,
video/audio content analysis.

                                                  1. INTRODUCTION
There are times when the past encumbers what we do and the way we think. In the evolution of the automobile, the early car
was simply a carriage with a motor. We are at a similar junction with video on the PC. The metaphor of film permeates the
design of video software for the PC. Software products to manage and edit video on the PC are making us ride in a carriage
when better designs should be made available to the consumer.

Video on the PC is not film. Film is a non-interactive, sequential, shared experience while video on the PC is interactive,
randomly accessible, and typically, unshared experience. Transplanting the experience of film onto the PC without
significant change in design will not leverage the advantages of the PC.

The PC is a tool that can provide a different experience. PC’ are good at providing random access whether searching for
medical information on the web or finding a recent email. It is meant to be interactive and responsive to requests from what
is typically one user. Using the PC for video without interactivity would only make it an expensive VCR.

Video on the PC is a relatively new phenomenon, especially at the consumer level. The capture devices are only now
becoming widely available and we can start experimenting. There are 2 questions these experiments can help answer:

              1. What is the information to be organized?
              2. Can the content benefit from an interactive presentation?
In the rest of the introduction our attention will focus the first question while we will address the second one in the rest of the
paper. So what kind of videos qualifies them naturally as a playground?

Movies and TV shows are probably not useful to be used on the PC because the content is strongly oriented towards a stream
of shared experience. Jumping around inside “Saving Private Ryan” would probably not enhance the viewing experience.
The PC simply does not have much to do other than playing the video. There might be exceptions such as skipping
commercials or replaying sporting events. Jumping around the tape of a baseball game to find the action, repeating the
important plays, or single stepping through the controversial plays, might significantly enhance the viewing experience.

A second choice might be educational materials. Imagine a web site with video showing origami (paper folding) for a variety
of objects. You would select the swan or balloon that you would enjoy learning how to create and watch just that section.
You would be able to refer to it again as you struggle with the different folds. However, there is not a lot of such material
available. In addition, the integration of the content with the presentation is an important design element that might yield
unique solutions useful only to, say, origami.

Both film and educational materials suffer from an additional conflict: ownership. Finding a novel presentation of
copyrighted material could be enormously frustrating. A technical problem solved without a legal context is an invitation to

What content is available for the PC? Since about the mid-1980’ consumer video cameras have been widely available.
Parents have bookshelves full of videotapes from nearly every one of their child’ events, from birthdays to sporting events.
Unfortunately, the content of this bookshelf often goes unused. A tape labeled with the location of a recent vacation might
also have the winning goal from a soccer game. A skiing trip can be on the end of one tape and on the beginning of another.
Worse, sorting and organizing tapes are almost never done so even finding a tape from the right time period is difficult.
Also, once the tape is found, finding the right place on the tape is troublesome, especially since there is no certainty about the

Home video may not be the most exciting video content but it is a good place to experiment with what the PC can do. A
solution to the problem of organizing home videos will provide a good metaphor for other categories such as organizing news
footage for a local TV station or providing video reference material in a library.

Our approach treats video like a text document, providing functionality similar to a text editor. The rest of this paper presents
in detail our design of “video document” and several experiments we conducted to explore the variety of things that a PC can
do with video.

                                                          2. TOOLS
Before discussing the experiments with a home video archive or any other application, a short summary of the enabling tools
will serve to orient developers to follow. The target platform is Windows 98tm and Windows NTtm, primarily to allow the
software to be widely shared with colleagues. The programming tools were Microsoft Visual Basictm for the user interface
and a C++ compiler for the component libraries. The justification for using Visual Basic was to make it simple to try a
design, throw it out and try again. The studied minimalism of code written in VB makes this possible.

Of all the components used, none is more important than MPEG Processing Library (MPL), a general purpose and optimized
MPEG processing software infrastructure developed in our lab. The library provides random access to any frame and was
considerably faster than all commercially available libraries. The MPL component also allowed the creation of necessary
methods, properties, and events to make the experimentation easier.

Shot detection (discussed later) is a key component to the process of creating visual overview of video on the PC. Shot
detection is often the first step after capturing video because it provides the skeleton for subsequent views.

Common to almost all presentations is the use of a grid control that allows cells in the grid to contain pictures. VideoSoft’s
VSFlextm control is one such control that is commercially available. Other controls are available which can provide a similar

Databases are not more efficient than simple flat files. When used to manage video, the database need only contain a few
numbers such as the starting frame number, ending frame number, and the representative key frame. However, by using
database technology, the meaning of the individual fields is transparent and self-documenting. Anyone can understand your
definitions using a variety of database programs. More important, adding new fields for different experiments such as
allowing the user to associate a description with any shot is easily handled while maintaining backward compatibility. The
specific database tool employed is Microsoft’ Data Access Objectstm (DAO.) Microsoft’ Accesstm is the tool used to verify
                                               s                                         s
the contents of the database as well as expose field names and definitions.

Installation and setup are important steps in the proliferation of any software tool. Developing the setup script while
developing the software is a simple empirical method to ensure that software gets into the hands of real users. Great products
have been trapped inside poorly managed setup scripts. A number of commercially available packages were evaluated and
all suffered from the same deficiencies – unique scripting languages, quirky user interfaces, and a failure to make the
installation process transparent. Writing the setup script in a widely available language does more than make the syntax
familiar. It makes the process transparent and the complexities of installation in a Microsoft environment approachable. A
simple bootstrap program ensures that all necessary components of the Visual Basic environment are present. The VB setup
program takes control, allowing a flexible user interface with the familiar wizard metaphor. Serious effort was put into
making the install process as fast as possible while avoiding a reboot as often as possible.

                                      3. VIDEO CAPTURE AND PLAYBACK
The purpose of this paper is to focus on what happens after the video has been captured. However, capture has to be done to
make the content available as an MPEG-1/2 file on the PC. Once a video is captured it is desirable to always preserve the
original. In our scheme, all changes to the content like insertion, deletion, and transitions do not alter the original, only the
way it is played. Alternative views or abstractions may be created in a future examination of the videos but the originals will
remain intact. Some portion of the video that is not interesting now could prove invaluable later. Video is a lot for most
PC’ to digest at this point, and eliminating copies of even portions of a video is desirable and could be desirable even when
it is no longer necessary.

Critical to this efficiency of keeping the original copy is seamless playback of edited video. Without seamless playback,
viewer/editor disparities can emerge. If the playback of a video while editing were different from the playback when simply
viewing the video, it would no longer be a WSIWYG solution. MPL provides real-time access to individual I, P, and B
frames by means of the creation of an index file.

                                                4. VIDEO DOCUMENT
1. Shot View
The first step in building a video document on the PC is shot detection. A shot is a video sequence recorded by a camera’    s
uninterrupted operation. For instance, if the video contains a baby walking, then breaks, and then shows the baby sitting in a
high chair eating, that break would define a shot boundary. The camera was stopped, pointed at something new, and started

In principle, shot boundaries can be detected automatically in two ways: Either there is some metadata attached with the
video that allows finding shot boundaries or the shot boundaries must be deduced by analyzing the visual stream. In the case
of digital video (DV) the cameras encode the date and time of recording with the video stream, which makes it possible for a
computer to find any break in the time line easily. However, on conventional analog video cameras, this information either
does not exist or is not externally available. If the video is MPEG encoded by an external device, any time information even
in DV is gone. Therefore, shot detection is performed on the visual stream.

Shot detection is a widely researched area [1-6]. We use the hard cut detection algorithm proposed in [5] and the fade
detector in [4] to determine shots. Since both detectors work on only the DC coefficients of the MPEG video, our shot
detector runs 19 times faster than real-time on a Pentium-II 450 MHz.

Once the shots have been detected, a key frame from the middle of the shot is used as a representative of the shot. Preparing
a grid full of these key frames is a natural way to provide random access to any shot within a video and is called the Shot
View (see Figure 1.) Some natural ways of using a grid fall out from showing the Shot View. Clicking on a frame will
open it in a full-size video player. Double-clicking any key frame will play the shot starting at that location.

Because the grid contains many small images, it is useful to put several players at the bottom of the screen. Clicking an
image readies one of the players at that frame and enlarges the frame to full size. This allows closer inspection before
playing. Additionally, similar images can be compared when both are enlarged in separate players.
The size of the key frames in the grid is variable. Some videos will benefit from smaller images so that more can appear on
the screen. Making the images as small as possible allows more on the screen and easier scanning and scrolling. If more
detail is desired the key frames can be enlarged.

One of the key features of the grid format is that it does not matter whether you are viewing the video or editing it. The grid
defines a mechanism for random access that is useful in both contexts. For purposes of product definition, it may be useful to
have both an editor and a viewer, much like the Adobe Acrobat model. For Acrobat, the reader is freely distributed and the
editor to create Acrobat documents is sold to a narrow market. What is shown in Figure 1 is the reader while Figure 2
shows the editor.

Figure 1: This shows the basic grid of key frames, highlighting the selected frame. In addition, the screen width
permitted 2 player windows at the bottom. Click on an image to display it full-size. Double-click to playback the

2. Shot Detection Problems
One of the first problems encountered when writing a shot detection component for home video is flash photography.
Flashes are very common in home video and dramatically change the color of the entire contents of the shot. This, in
general, will result in a shot boundary for most detection algorithms unless the software either restricts shot boundary
detection to hard cuts and fades or explicitly searches for shot boundaries caused by flashes. In the first case, flashes are just
not detected since they usually cause a double spike in the sequence of color histogram differences of video frames, thus
violating the definition of hard cuts. On the other side a double spike is too short to be a fade. It is reasonable to assume that
raw video footage does not contain other types of transitions. In the second case, it could be checked whether

• the sequence of color histogram differences across a shot boundary exhibits a double spike,
• a participating frame posses a significantly higher average brightness than its two delineating frames, and
• the two delineating frames’color contents are very similar.

Another problem with shot detection is the selection of the key frame. For our purpose it is a temptation to have the first
frame of any shot be the key frame to represent the shot. However, the first frame or even the first few frames are often
transitions or fades or mangled by a poor quality video camera. A compromise was to display the middle frame as the
representative shot. The interface allows the user to override the default to make any memorable frame the representative
shot. Alternative algorithms to automatically select a key frame should be explored. One algorithm might be to take not the
first shot but a shot some small fraction of the way from the beginning. A more expensive solution might be to average the
color of all frames in a shot and find the frame closest to the average color.

Figure 2: A Time View showing frames at 1-second intervals. Rather than sequentially searching a video for the
moment when the baby was whistling, the Time View allows you to find the frame at a glance. Also shown in this
version is the Asset Manager (left side) to manage the list of videos on the machine.
3. Searching the Video
The conventional metaphor for playing video is a series of play, pause, and stop buttons seen on most consumer audio
devices. Beyond that conventional metaphor, Figure 1 shows some buttons that allow the user to jump to the previous shot,
previous frame, next frame and next shot. The ability to play the video in a fast backward and fast forward fashion allows
the user to search a sequence.

Searching backward and forward is a common means of looking at videotape and is present on many consumer devices.
Using fast forward to find a specific frame on the PC is quite common. However, searching backward is not present in any
commercially available software player or editor. Our fast backward capability, which is enabled by MPL’ support for
backward frame access in MPEG video, fills in a hole left in most mechanisms to search video on the PC.

4. Time View
A Time View (Figure 2) is a series of frames selected at a specific time interval, e.g. one frame for every two seconds. The
Time View can provide a simpler way to search than conventional playback because many frames can be viewed at once.
The mosaic of frames has all the same ability to click, double-click, or fast forward. More precision can be obtained by
selecting a shot or a series of shots and then opening a Zoom View that is simply a Time View for the selected time. A Zoom
View can be thought of as a Time View with finer granularity.

5. Asset Management
Video files are large and keeping track of them is important as it represents a high percentage of the disk usage. Asset
management is an attempt to collect a meaningful view of all the video available to the system. There are two parts to asset
management, the Asset Manager and the Asset Summary (see Figure 3.) The Asset Manager is always visible on the left
side of the screen. Each asset in the view shows the principle name of the video. This tree view is used to select which video
to work on or which view of the video is to be opened.

The Asset Summary is a grid containing all video content on the PC. It can be sorted in a variety of ways by clicking on the
title of the respective column. The Asset Summary can be used to find a video from a particular date or with a specific
description or category.

It is difficult to predict what variables will be important to keep in the database but month and year are obvious candidates,
although they are often found in the title as well. Putting the month and year in the grid allows the videos to be sorted by
date and this can be helpful. Description is potentially useful and should be there. Category is an attempt to think ahead to
the time when there are so many videos that there will be a need for further categories.

Any field in the grid can be edited just like a spreadsheet and a number of them have pulldown menus to make the edits

Figure 3: The Asset Manager along the left side is used to select videos to work on or to open a view of a particular
video. The Asset Summary in the work area of the form is an editable grid displaying all the information in the
database about each video.
6. Edit, Cut, Copy, and Paste
Accessing any frame in a video can be more useful if that frame can be pulled out of the video and copied elsewhere. The
image can be placed in an email or posted to the web. Frame accurate control is possible using the frame selection controls
(the plus (+) and minus (-) signs in the toolbar). The image is copied to the clipboard in either JPEG or Windows Bitmap

The Copy function was the first and simplest to implement but the comparison with a word processor only began there. The
Copy function simply moves an image and its associated temporal extent into the system clipboard for a later paste.
Selecting a frame and then specifying the Cut function removes that shot or frame sequence from the view. If more than one
frame is selected, then more than one is cut from the document.

Selecting more than one key frame in the grid and using the Copy function makes that series of frames available to a Paste
function. That series of shots can then be placed into any location in the video document. Anyone familiar with a word
processor or a spreadsheet will have no trouble with the concepts. Figure 4 shows the three steps of highlighting a sequence
of 5 shots from a video sequence, cutting out the segment and finally pasting the five shots at the end of the sequence. This
sequence of highlight-cut-paste on shots is exactly similar to the analogous operations on text. Figure 5 illustrates the method
employed by video editing software --- a timeline is used for the creation of a new video sequence. Segments of video are
selected and dropped onto the timeline. Our video document approach allows editing to be made on the existing view of the
video in the same fashion as text editing is performed, whereas the present video-editing paradigm requires the creation of a
new sequence based on segments of existing ones.

The edit functions also work when using more than one video document. For instance, a series of frames in one document
may be copied to the clipboard and then pasted onto another video document. When the combined video document is played,
the video player will seamlessly switch between videos during playback. To anyone watching the video, it will appear that
the two videos are part of the single MPEG file.

When the combined video document is saved, it is not saved as a single MPEG file. Instead, the database for the video asset
is updated to reflect the new series of shots with their respective video file names. The typical table in the database for such a
video document is small, usually less than 10k bytes.

 Figure 4: The steps of highlighting a sequence of shots, cutting the highlighted segments and pasting the segments at
                      the end of the sequences. Editing is done directly on the current sequence.
Figure 5: The current video editing approach --- creating a new video sequence on a timeline view. Editing is
                                    performed by the creation of a new sequence.
7. MPEG Native Editor
Playing an edited video with the MPL library always uses the original video content for playback and manipulates it on the
fly (it is fast enough to produce a seamless video stream.) However, there are a variety of other hardware, software decoders
or other DirectShow applications that the consumer might want to use. Our video editor needs to be able to export MPEG
files to these other application.

The MPEG native editor is provided for that purpose. The MPEG native editor is built with a precise knowledge of the
MPEG standard and maintains the bit rate of the original and the series of timecodes in the MPEG stream. There are plenty
of video editors available that handle MPEG editing, however, nearly all of them require decompression and recompression
of the whole edited video. In contrast, the MPEG native editor just copies those GOPs that are untouched by the editing.
Only new shot boundaries need decompression and recompression of the affected GOPs. This reduces the time of
computation significantly.

8. Anti-Jitter Playback
Working with a variety of home videos means encountering some pretty poor hand-held camera work. Rapid pans, rapid
zooms and an inability to hold the camera steady make viewing the video difficult if not nauseating. Many of the modern
cameras have image stabilization features that reduce jitter considerably. However, rapid pans and zooms cannot be
eliminated even with image stabilization. In addition, many videos have already been created without image stabilization.
Therefore, an anti-jitter playback feature has been added to the software specifically to handle these problems.

Anti-jitter playback works by giving the user control of the time between frames. One way to interpret it is to say that it
converts the video to a slide show with the original audio playing. Allowing the user to focus on an image that is on the
screen for one or more seconds reduces the nausea. If the jitter is low, the images may be updated more frequently. If it is
high, the image may be held for several seconds.

The resources required for anti-jitter playback are less than normal playback because far fewer frames are actually displayed.
By doing less work, unviewable video will be seen and heard. For the future, we plan to fight jitter in the narrow sense (i.e.,
no rapid pan and zooms) also by cropping multiple frames in time so as to keep the “image” stable.

9. CD-ROM Image
Documents are meant to be shared. The same is true for video documents. Whatever you have created - the different edited
versions of a video, its video abstracts and meta-information - you should be able to pass it on to other people, so that you can
share what you have created. Therefore, we provide the capability to burn CD-ROMs that include the MPEG file, the
different views, the meta-information as well as self-installing playback software. This playback software is strictly a viewer
that will present a grid of video (based on the results of the shot detection) and allow random access. It does not include the
edit, cut, and paste features or the Asset Manager. Once you have inserted the video CD-ROM into you computer it will start
the viewer application automatically.
One positive side effect of creating CD-ROM images of video documents is that it provides an easy and cheap backup
solution for the large MPEG files.

10. Date Capture
It would be nice if the date of every video were captured along with the video. Digital video recorders have the date encoded
within the video stream but once the video is captured (usually in an external capture device), there is no provision to carry
the date in the MPEG format. The simple solution is to force the user to provide month and year for every video that is added
to the video asset.

There is another solution but it requires a great deal more processing to obtain (about 10 minutes for a 1-hour MPEG-1 video
on a Pentium II 450Mz). Most home video cameras have a provision to place a month, day, and year on the video as it is
being captured. This date can be obtained automatically by some unconventional Optical Character Recognition (OCR)
techniques [7-12].

Conventional OCR would take a single frame and try to isolate the characters by finding the edges, performing some
horizontal and vertical transformation to create projections that are looked up in a table. However, OCR on a video frame is
not the simple black and white problem that appears when working on the printed page. The letters are typically white or
light gray but the area behind the letters may be snow, making the letters virtually invisible.

The technique used here is a simplified version of the one found in a paper by [11]. It basically takes advantage of the fact
that the date’ location is fixed over time and that the date’ color is very bright. For every shot the frames are stacked one
              s                                               s
over each other and the minimum pixel intensity is calculated over time for each pixel position. As a result, only date text
pixel will keep their brightness and can therefore be extracted by simple thresholding the combined image and removing
small regions not meeting the geometric requirements of characters. In order to improve the quality of the extracted date
character bitmaps, all images are scaled up by a factor of four using cubic interpolation before applying any operations.
Next, we utilize the fact that the date text has a very simple structure such as “MM DD YYYY” or “HH: MM: SS in order to
correct falsely recognized characters, fill-in missed characters and judge whether the recognition is correct or needs manual
revision. On 5.5 hours (676 shots) of home video from analog tapes, 96% of all dates were identified and judged correct,
3.1% were marked as needing manual resolution, and 0. 9% were recognized incorrectly.

                                                 5. VIDEO ABSTRACTS
Recording home videos with camcorders is much more popular than playing them back. This difference is due to the fact
that unedited home video footage is usually long-winded, lacks visually appealing effects, and thus tends to be too time-
consuming and boring to watch. However, most people do not have time to edit their videos, and even if they would have,
the resulting video is too inflexible and cannot adjust to the viewers’ needs. For instance, if some friends visit its very likely
that you only want to show them a 15 minute excerpt of your last vacation in order not to bore your guests, while when your
parents visit, a longer excerpt could be tolerable. This cannot be done easily with current systems. However, a system
capable of abstracting raw video into shorter video automatically in real-time could give a user that flexibility. It would
easily generate video abstracts satisfying the individual time-constraints.

There are other papers written to describe how to create a video abstract from a home video [13]. The importance here is that
a short abstract can be created within a few seconds for any number of videos. There are two types of video abstracts created
with our software. Both methods require that each shot be cut down in a preprocessing step to a short clip of ten seconds at
most showing only the most important part. Importance is here defined as the part of the shot that has the largest audio
support [13]. Once the shots have been shortened two different kinds of abstracts can be created.

The first abstract is a simple Random Abstract that chooses clips randomly until the target abstract length is reached. Each
time a new abstract is created is will be different. The length of the abstract is specified along with the list of videos to
abstract. A simple request might select five minutes from three videos. The results are presented in a grid which can be
played or replayed any number of times.

The second type of abstract is called Smart Abstract. The difference is that the date of recording (see Section 4) is used to
cluster shots into a hierarchy of shot clusters of weeks and days. This hierarchy of shot clusters is then used to create
significantly better abstracts compared to the random abstracts. A Smart Abstract is driven by rules that implement the
following objects in order to create good-quality abstracts:
• Balanced Coverage. The video abstract should be composed of clips from all parts of the source video set.
    • Shortened Shots. Commonly, the duration of the raw, unedited shots is too long and the content too long-winded for
      video abstracts. Moreover, their uncut presence in video abstracts does not offer a balanced coverage of the source
      video material. Therefore, shots exceeding a maximum length must be cut down to their most interesting parts.
    • Random Selection. Due to the nature of home video material, all shots are generally more or less equally important.
      In addition, individual abstracts of the same source videos should vary each time in order to hold interest after
      multiple playbacks. Therefore, “controlled” random clip selection should be a key feature of our video abstracting
    • Focused Selection. If the abstraction ratio is high, commonly the case, the abstracting algorithm should focus only
      on a random subset of week clusters and the corresponding day clusters. Thus, a more detailed coverage on selected
      clusters is preferred over a totally balanced, but more superficial coverage.

Currently, there are no attempts to provide visual cues to the transitions between shots in the abstract but it would be highly
desirable. Showing a simple transition such as a door opening or a horizontal wipe gives the view a sense that there is a
break in the time.

It may be argued that creating a video abstract may be obsolete when scanning video is so easy using a shot view. Jumping
from frame to frame in a grid and playing small portions is the reason for the grid view. However, if creating a short video
to share with others who are passively watching, the abstract is demonstrably useful.

                                            6. MARKET COMPARISONS
The market for video capture devices is growing rapidly as more consumers discover utility in capturing video on the PC.
The video software market will expand with it because it is tied to the market for the hardware. The software for video
browsing, video editing, and video managing needs to become independent of the hardware just like the market for word
processors is completely independent of the platform. It is a question of providing an abstraction of what consumers like to
do with video. The video document is an instantiation of such an abstraction.

A market survey of existing products would be quickly out
of date. However, some general statements are possible.
Almost all the products suffer from one affliction or another
but more importantly, some of them exhibit a few of the
desirable traits outlined in this paper.

Some of the current software products suffer from problems
that can be seen in Figure 6. No one can complain about an
interface that looks like a TV from the Jetson’ cartoon show
but the failure to resize to the larger screen is unproductive.
Specialized cursors, filmstrips instead of grids, and modal
interfaces are common mistakes found in some of the current
crop of products. The filmstrip paradigm is unproductive for
most consumers because the filmstrip does not effectively
use the available screen space. It is inherently one-
dimensional. Doubling the screen size only results in a
filmstrip twice as long while the Grid View will be able to
                                                                  Figure 6: The typical video capture/edit process does
show four times as much content as before.
                                                                  not make good use of the space on a 1600x1200
                                                                  screen. Video Assets are managed on the right with
However, despite these faults, there is some evidence that
                                                                  only 8 characters for the name. Unique cursors
features present in our software are showing up in
                                                                  come and go depending on both the mode and where
commercially available products but each of those features is
                                                                  the mouse hovers. The ubiquitous filmstrip appears
crippled by the presence of the filmstrip legacy. Using these
                                                                  at the bottom.
software products is like reading a book without a good
editor. For instance, there are products that employ shot
detection to create a small grid of video but they do not work on MPEG files. There are MPEG editors but they have no shot
detection so manual intervention is required to define boundaries. There are equivalents to the Asset Manager but they have
no sort capability. There are ways to create MPEG files from multiple originals but they are slow (since they need
decompression and recompression), hard-to-learn, and often modal.

Nowhere is the design as complete as the one proposed here and even if our software is unsuccessful, it may help to define
what good video software should do. Most of the software available comes packaged with a hardware solution that
incorporates features of the capture device into the interface. This is like selling a keyboard with a word processor that only
works with that keyboard. This paper is an attempt to create an abstraction of what this new media of video on the PC

                 7. INTERNET VIDEO
The key problem with video on the Internet is bandwidth. In
order to transmit video, a streaming server first downloads
buffers to play while more buffers are downloaded. This
prevents running out of buffers and stalling the video. The
time consumed waiting to start a video may be 30 seconds or
more before the first frame of video is seen.

Worse, there is no way for the viewer to go faster or jump
around in the stream because the only buffers queued up are
those that would be used sequentially. Once consumed, the
buffers are tossed.   If a second viewing of the video is
requested, the streaming server will queue the buffers and
transmit the video again from the beginning with the same
delays as before.                                                  Figure 7: A grid of pictures in a web page that allows
                                                                   selecting where to begin the video.
One alternative is a grid view full of key frames in an HTML
page (Figure 7). This would allow the user to select where to begin viewing. The user has still frames in the grid to look at
while waiting to download buffers. In addition, the grid provides a visual cue that shows what has been just streamed to the
client PC. Once viewed, the user can select only that portion of the video that they want to see again.

The grid interface does not solve all the problems of streaming video. First, the video must still be buffered so there are
delays no matter where viewing starts. In addition, unless the web page displays some additional text or graphics, a single
frame may not be sufficient help to the viewer trying to determine where to start the video.

Nonetheless, streaming video from web servers is on the rise (typically movie trailers) and a grid view in a web page would
provide a better way to get the user more control over what they see while at the same reducing the bandwidth demands on
the network.

                                                   8. CONCLUSION
What conclusions can we make now that we have experimented with this software for most of a year? First, it is clear that
some of these attempts to create new formats for the presentation and use of video on the PC should receive further
experimentation, preferably in the marketplace or some format which enlists consumer input. The market potential for a
successful tool is enormous because so many people have PCs and so many will have video on those PCs in the coming
years. Video on the PC is a subject that has universal appeal.

The behavior of real people using video on the PC is a sociological question that properly should be answered by studies and
high-volume market surveys. It is possible for us to present data describing how many individuals preferred one format or
feature to another but engineers do not make the best marketing analysts. Such a market survey is beyond the scope of this
paper which is intended only to present the alternatives.

A better methodology is available to each of us through introspection and experimentation. This has been the approach used
in this paper - to design a tool that is intended for use at home with our own home videos. The feedback loop is similar to
that of a tinkerer who will keep trying until it is right.

Even without any consumer surveys or experimental data, some conclusions stand out as potentially valuable contributions to
the use of video on the PC. The first conclusion is that shot detection is vital to whatever process follows. The grid of key
frames provides a scaffold to support other structures that may be built. A second conclusion is that the grid view with a
healthy respect for the efficient use of screen space is an efficient way to present video. It provides the mechanism for
random access as well as an efficient way to sense what is in the video content.

It remains to be seen if building on the metaphor of word processing is useful to consumers. Will edit, cut, and paste be
important interactions with video? Will consumers just want to watch the video or will random access matter? The behavior
of frustrated channel surfers is an indication that more interactions with video content are desirable and should be pursued.

Regardless of what happens, it is important to jettison the metaphor of film and treat the PC as a new kind of medium for
video. The video document concept represents a historically unencumbered way of pursuing the future. The features of a
video document that are most likely to be embraced are:

     •   Shot detection provides a simple way to organize video content.
     •   A grid presentation of key frames is an efficient overview and creates the mechanism for random access.
     •   The word processing metaphors of edit, cut, and paste build on existing, well-understood uses of the PC.
     •   MPEG files are the medium for exchange to save disk space.
     •   Seamless playback while editing is an essential WSIWYG feature.
     •   The ability to create new MPEG files from edited views is essential for sharing to other venues.
     •   Video content needs some form of asset management, preferably sorted in a variety of ways.
     •   Individual frames can be searched, copied, and pasted elsewhere.
     •   A self-installing playback tool should be provided for CD-ROM images for sharing to other PCs.
     •   The date on the video should be automatically captured, either via OCR or directly from the DV.

The above features are viewed as essential to enhancing the use of video on the PC. These additional features are less
convincing features of the video document that need further exploration and experimentation:

     • Some methods for creating an abstract can be useful for exporting to other playback environments.
     • The anti-jitter playback of poor quality video is desirable but a more satisfying result should still be attempted.

Some other areas for further work include:

     • We completely neglected titling in our implementations. This was a practical choice guided by the demands of
       research. It is reasonable to expect that users will want to place titles in their videos.
     • Providing some transitions and special effects would mark time changes or editing cuts. Abstracts would benefit
       from the visual cue of a transition effect.

                                                    9. REFERENCES
1.   J. S. Boreczky and L. A. Rowe. Comparison of Video Shot Boundary Detection Techniques. In Storage and Retrieval
     for Still Image and Video Databases IV, Proc. SPIE 2664, pp. 170-179, Jan. 1996.
2.   A. Dailianas, R. B. Allen, P. England: Comparison of Automatic Video Segmentation Algorithms. In Integration Issues
     in Large Commercial Media Delivery Systems, Proc. SPIE 2615, pp. 2-16, Oct. 1995.
3.   A. Hampapur, R. C. Jain, and T. Weymouth. Production Model Based Digital Video Segmentation. Multimedia Tools
     and Applications, Vol. 1, No. 1, pp. 9-46, Mar. 1995.
4.   R. Lienhart. Comparison of Automatic Shot Boundary Detection Algorithms. In Storage and Retrieval for Image and
     Video Databases VII, SPIE Vol. 3656, pp. 290-301, Jan. 1999.
5.   B.-L. Yeo and B. Liu. Rapid Scene Analysis on Compressed Video. IEEE Transactions on Circuits and Systems for
     Video Technology, Vol. 5, No. 6, pp. 533-544, December 1995.
6.   R. Zabih, J. Miller, and K. Mai. A Feature-Based Algorithm for Detecting and Classifying Scene Breaks. Proceedings
     ACM Multimedia 95, San Francisco, CA, pp. 189-200, Nov. 1995.
7.   H. Li, D. Doermann and O. Kia. Automatic Text Detection and Tracking in Digital Video. IEEE Trans. on Image
     Processing. To appear.
8.   R. Lienhart. Automatic Text Recognition for Video Indexing. In Proceedings of the ACM Multimedia ‘ (Boston,
     MA, USA, November 11-18, 1996), S. 11-20, November 1996.
9.    R. Lienhart and W. Effelsberg. Automatic Text Segmentation and Text Recognition for Video Indexing. Technical
      Report TR-98-009, Praktische Informatik IV, University of Mannheim, May 1998. To appear in ACM/ Springer
      Multimedia Systems Magazine.
10.   A. K. Jain and S. Bhattacharjee. Text Segmentation Using Gabor Filters for Automatic Document Processing. Machine
      Vision and Applications, Vol. 5, No. 3, S. 169-184, 1992.
11.   T. Sato, T. Kanade, E. Hughes, and M. Smith. Video OCR for Digital News Archives. IEEE Workshop on Content-
      Based Access of Image and Video Databases (CAIVD'98), Bombay, India, January, 1998.
12.   V. Wu, R. Manmatha and E. M. Riseman. Finding Text in Images. In Proceedings of Second ACM International
      Conference on Digital Libraries, Philadelphia, PA, pp. 23-26, July 1997.
13.   R. Lienhart and B.-L. Yeo. Automatic Abstraction of Home Video Footage into Shorter Video. Submitted to IEEE
      Transactions on Multimedia.

More Related Content

Viewers also liked (6)

Russia Art
Russia ArtRussia Art
Russia Art
Lado Alexi
Lado AlexiLado Alexi
Lado Alexi
Balanço CAIXA: Anual 2015
Balanço CAIXA: Anual 2015 Balanço CAIXA: Anual 2015
Balanço CAIXA: Anual 2015
Teoretski pra
Teoretski praTeoretski pra
Teoretski pra
The Big Slide2
The Big Slide2The Big Slide2
The Big Slide2
Math Lesson Days 3, 4 and 5
Math Lesson Days 3, 4 and 5Math Lesson Days 3, 4 and 5
Math Lesson Days 3, 4 and 5

Similar to Articulo

Client formal specification
Client formal specificationClient formal specification
Client formal specification
Digital World Expo - Vidi This Class - Day 2
Digital World Expo - Vidi This Class - Day 2Digital World Expo - Vidi This Class - Day 2
Digital World Expo - Vidi This Class - Day 2
Module 2 3
Module 2 3Module 2 3
Module 2 3
Video Conferencing, The Enterprise and You
Video Conferencing, The Enterprise and YouVideo Conferencing, The Enterprise and You
Video Conferencing, The Enterprise and You

Similar to Articulo (20)

Digital video
Digital videoDigital video
Digital video
Client formal specification
Client formal specificationClient formal specification
Client formal specification
Video Compression
Video CompressionVideo Compression
Video Compression
Chapter5a McHaney 2nd edition
Chapter5a McHaney 2nd editionChapter5a McHaney 2nd edition
Chapter5a McHaney 2nd edition
Why you should use the Yocto Project
Why you should use the Yocto ProjectWhy you should use the Yocto Project
Why you should use the Yocto Project
Extract the Audio from Video by using python
Extract the Audio from Video by using pythonExtract the Audio from Video by using python
Extract the Audio from Video by using python
A2 Media - Evaluation
A2 Media - EvaluationA2 Media - Evaluation
A2 Media - Evaluation
Digital World Expo - Vidi This Class - Day 2
Digital World Expo - Vidi This Class - Day 2Digital World Expo - Vidi This Class - Day 2
Digital World Expo - Vidi This Class - Day 2
Module 2 3
Module 2 3Module 2 3
Module 2 3
Chapter5a McHaney
Chapter5a McHaneyChapter5a McHaney
Chapter5a McHaney
Performance Analysis of Various Video Compression Techniques
Performance Analysis of Various Video Compression TechniquesPerformance Analysis of Various Video Compression Techniques
Performance Analysis of Various Video Compression Techniques
Unit 1 (lo1) revision help
Unit 1 (lo1) revision helpUnit 1 (lo1) revision help
Unit 1 (lo1) revision help
Ist264 sowards h_chapter8labjournal
Ist264 sowards h_chapter8labjournalIst264 sowards h_chapter8labjournal
Ist264 sowards h_chapter8labjournal
eClassrooms Come of Age?
eClassrooms Come of Age?eClassrooms Come of Age?
eClassrooms Come of Age?
Video Conferencing, The Enterprise and You
Video Conferencing, The Enterprise and YouVideo Conferencing, The Enterprise and You
Video Conferencing, The Enterprise and You
ShowNTell: An easy-to-use tool for answering students’ questions with voice-o...
ShowNTell: An easy-to-use tool for answering students’ questions with voice-o...ShowNTell: An easy-to-use tool for answering students’ questions with voice-o...
ShowNTell: An easy-to-use tool for answering students’ questions with voice-o...
Slideshare video production_guide-part2-editing
Slideshare video production_guide-part2-editingSlideshare video production_guide-part2-editing
Slideshare video production_guide-part2-editing
An Introduction to Video Production for Digital Media (2012)
An Introduction to Video Production for Digital Media (2012)An Introduction to Video Production for Digital Media (2012)
An Introduction to Video Production for Digital Media (2012)

More from Hermenegildo Fernández (7)

Diseño y Especificación de un Marco de Evidencias de Diseño Centrado en el Us...
Diseño y Especificación de un Marco de Evidencias de Diseño Centrado en el Us...Diseño y Especificación de un Marco de Evidencias de Diseño Centrado en el Us...
Diseño y Especificación de un Marco de Evidencias de Diseño Centrado en el Us...
Protocolo de tesis | Maestría en Medios Interactivos
Protocolo de tesis | Maestría en Medios InteractivosProtocolo de tesis | Maestría en Medios Interactivos
Protocolo de tesis | Maestría en Medios Interactivos
El fin de lo efimero . Being Human: Human-Computer Interaction in the year 2020
El  fin de lo efimero . Being Human: Human-Computer Interaction in the year 2020El  fin de lo efimero . Being Human: Human-Computer Interaction in the year 2020
El fin de lo efimero . Being Human: Human-Computer Interaction in the year 2020
Paseo cognitivo
Paseo cognitivo Paseo cognitivo
Paseo cognitivo
Paseo cognitivo
Paseo cognitivoPaseo cognitivo
Paseo cognitivo


  • 1. The Video Document Bob Davies, Rainer Lienhart, and Boon-Lock Yeo Intel Corporation Microcomputer Research Lab 2200 Mission College Blvd. Santa Clara, CA 95052 - 8119 {Bob.Davies, Rainer.Lienhart, Boon-Lock.Yeo} ABSTRACT The metaphor of film and TV permeates the design of software to support video on the PC. Simply transplanting the non- interactive, sequential experience of film to the PC fails to exploit the virtues of the new context. Video on the PC should be interactive and non-sequential. This paper experiments with a variety of tools for using video on the PC that exploits the new context of the PC. Some features are more successful than others. Applications that use these tools are explored, including primarily the home video archive but also streaming video servers on the Internet. The ability to browse, edit, abstract and index large volumes of video content such as home video and corporate video is a problem without appropriate solution in today’ market. The current tools available are complex, unfriendly (professional) video editors, requiring hours of work to s prepare a short home video, far more work that a typical home user can be expected to provide. Our proposed solution treats video like a text document, providing functionality similar to a text editor. Users can browse, interact, edit and compose one or more video sequences with the same ease and convenience as handling text documents. With this level of text-like composition, we call what is normally a sequential medium a “video document”. An important component of the proposed solution is shot detection, the ability to detect when a shot started or stopped. When combined with a spreadsheet of key frames, the shots become a grid of pictures that can be manipulated and viewed in the same way that a spreadsheet can be edited. Multiple video documents may be viewed, joined, manipulated, and seamlessly played back. Abstracts of unedited video content can be produced automatically to create novel video content for export to other venues. Edited and raw video content can be published to the net or burned to a CD-ROM with a self-installing viewer for Windows 98 and Windows NT 4. 0. Keywords: Video grid, shot grid, video editor, MPEG-native editor, home video archive, shot detection, video abstraction, video/audio content analysis. 1. INTRODUCTION There are times when the past encumbers what we do and the way we think. In the evolution of the automobile, the early car was simply a carriage with a motor. We are at a similar junction with video on the PC. The metaphor of film permeates the design of video software for the PC. Software products to manage and edit video on the PC are making us ride in a carriage when better designs should be made available to the consumer. Video on the PC is not film. Film is a non-interactive, sequential, shared experience while video on the PC is interactive, randomly accessible, and typically, unshared experience. Transplanting the experience of film onto the PC without significant change in design will not leverage the advantages of the PC. The PC is a tool that can provide a different experience. PC’ are good at providing random access whether searching for s medical information on the web or finding a recent email. It is meant to be interactive and responsive to requests from what is typically one user. Using the PC for video without interactivity would only make it an expensive VCR. Video on the PC is a relatively new phenomenon, especially at the consumer level. The capture devices are only now becoming widely available and we can start experimenting. There are 2 questions these experiments can help answer: 1. What is the information to be organized? 2. Can the content benefit from an interactive presentation?
  • 2. In the rest of the introduction our attention will focus the first question while we will address the second one in the rest of the paper. So what kind of videos qualifies them naturally as a playground? Movies and TV shows are probably not useful to be used on the PC because the content is strongly oriented towards a stream of shared experience. Jumping around inside “Saving Private Ryan” would probably not enhance the viewing experience. The PC simply does not have much to do other than playing the video. There might be exceptions such as skipping commercials or replaying sporting events. Jumping around the tape of a baseball game to find the action, repeating the important plays, or single stepping through the controversial plays, might significantly enhance the viewing experience. A second choice might be educational materials. Imagine a web site with video showing origami (paper folding) for a variety of objects. You would select the swan or balloon that you would enjoy learning how to create and watch just that section. You would be able to refer to it again as you struggle with the different folds. However, there is not a lot of such material available. In addition, the integration of the content with the presentation is an important design element that might yield unique solutions useful only to, say, origami. Both film and educational materials suffer from an additional conflict: ownership. Finding a novel presentation of copyrighted material could be enormously frustrating. A technical problem solved without a legal context is an invitation to litigation. What content is available for the PC? Since about the mid-1980’ consumer video cameras have been widely available. s Parents have bookshelves full of videotapes from nearly every one of their child’ events, from birthdays to sporting events. s Unfortunately, the content of this bookshelf often goes unused. A tape labeled with the location of a recent vacation might also have the winning goal from a soccer game. A skiing trip can be on the end of one tape and on the beginning of another. Worse, sorting and organizing tapes are almost never done so even finding a tape from the right time period is difficult. Also, once the tape is found, finding the right place on the tape is troublesome, especially since there is no certainty about the contents. Home video may not be the most exciting video content but it is a good place to experiment with what the PC can do. A solution to the problem of organizing home videos will provide a good metaphor for other categories such as organizing news footage for a local TV station or providing video reference material in a library. Our approach treats video like a text document, providing functionality similar to a text editor. The rest of this paper presents in detail our design of “video document” and several experiments we conducted to explore the variety of things that a PC can do with video. 2. TOOLS Before discussing the experiments with a home video archive or any other application, a short summary of the enabling tools will serve to orient developers to follow. The target platform is Windows 98tm and Windows NTtm, primarily to allow the software to be widely shared with colleagues. The programming tools were Microsoft Visual Basictm for the user interface and a C++ compiler for the component libraries. The justification for using Visual Basic was to make it simple to try a design, throw it out and try again. The studied minimalism of code written in VB makes this possible. Of all the components used, none is more important than MPEG Processing Library (MPL), a general purpose and optimized MPEG processing software infrastructure developed in our lab. The library provides random access to any frame and was considerably faster than all commercially available libraries. The MPL component also allowed the creation of necessary methods, properties, and events to make the experimentation easier. Shot detection (discussed later) is a key component to the process of creating visual overview of video on the PC. Shot detection is often the first step after capturing video because it provides the skeleton for subsequent views. Common to almost all presentations is the use of a grid control that allows cells in the grid to contain pictures. VideoSoft’s VSFlextm control is one such control that is commercially available. Other controls are available which can provide a similar service. Databases are not more efficient than simple flat files. When used to manage video, the database need only contain a few numbers such as the starting frame number, ending frame number, and the representative key frame. However, by using
  • 3. database technology, the meaning of the individual fields is transparent and self-documenting. Anyone can understand your definitions using a variety of database programs. More important, adding new fields for different experiments such as allowing the user to associate a description with any shot is easily handled while maintaining backward compatibility. The specific database tool employed is Microsoft’ Data Access Objectstm (DAO.) Microsoft’ Accesstm is the tool used to verify s s the contents of the database as well as expose field names and definitions. Installation and setup are important steps in the proliferation of any software tool. Developing the setup script while developing the software is a simple empirical method to ensure that software gets into the hands of real users. Great products have been trapped inside poorly managed setup scripts. A number of commercially available packages were evaluated and all suffered from the same deficiencies – unique scripting languages, quirky user interfaces, and a failure to make the installation process transparent. Writing the setup script in a widely available language does more than make the syntax familiar. It makes the process transparent and the complexities of installation in a Microsoft environment approachable. A simple bootstrap program ensures that all necessary components of the Visual Basic environment are present. The VB setup program takes control, allowing a flexible user interface with the familiar wizard metaphor. Serious effort was put into making the install process as fast as possible while avoiding a reboot as often as possible. 3. VIDEO CAPTURE AND PLAYBACK The purpose of this paper is to focus on what happens after the video has been captured. However, capture has to be done to make the content available as an MPEG-1/2 file on the PC. Once a video is captured it is desirable to always preserve the original. In our scheme, all changes to the content like insertion, deletion, and transitions do not alter the original, only the way it is played. Alternative views or abstractions may be created in a future examination of the videos but the originals will remain intact. Some portion of the video that is not interesting now could prove invaluable later. Video is a lot for most PC’ to digest at this point, and eliminating copies of even portions of a video is desirable and could be desirable even when s it is no longer necessary. Critical to this efficiency of keeping the original copy is seamless playback of edited video. Without seamless playback, viewer/editor disparities can emerge. If the playback of a video while editing were different from the playback when simply viewing the video, it would no longer be a WSIWYG solution. MPL provides real-time access to individual I, P, and B frames by means of the creation of an index file. 4. VIDEO DOCUMENT 1. Shot View The first step in building a video document on the PC is shot detection. A shot is a video sequence recorded by a camera’ s uninterrupted operation. For instance, if the video contains a baby walking, then breaks, and then shows the baby sitting in a high chair eating, that break would define a shot boundary. The camera was stopped, pointed at something new, and started again. In principle, shot boundaries can be detected automatically in two ways: Either there is some metadata attached with the video that allows finding shot boundaries or the shot boundaries must be deduced by analyzing the visual stream. In the case of digital video (DV) the cameras encode the date and time of recording with the video stream, which makes it possible for a computer to find any break in the time line easily. However, on conventional analog video cameras, this information either does not exist or is not externally available. If the video is MPEG encoded by an external device, any time information even in DV is gone. Therefore, shot detection is performed on the visual stream. Shot detection is a widely researched area [1-6]. We use the hard cut detection algorithm proposed in [5] and the fade detector in [4] to determine shots. Since both detectors work on only the DC coefficients of the MPEG video, our shot detector runs 19 times faster than real-time on a Pentium-II 450 MHz. Once the shots have been detected, a key frame from the middle of the shot is used as a representative of the shot. Preparing a grid full of these key frames is a natural way to provide random access to any shot within a video and is called the Shot View (see Figure 1.) Some natural ways of using a grid fall out from showing the Shot View. Clicking on a frame will open it in a full-size video player. Double-clicking any key frame will play the shot starting at that location. Because the grid contains many small images, it is useful to put several players at the bottom of the screen. Clicking an image readies one of the players at that frame and enlarges the frame to full size. This allows closer inspection before playing. Additionally, similar images can be compared when both are enlarged in separate players.
  • 4. The size of the key frames in the grid is variable. Some videos will benefit from smaller images so that more can appear on the screen. Making the images as small as possible allows more on the screen and easier scanning and scrolling. If more detail is desired the key frames can be enlarged. One of the key features of the grid format is that it does not matter whether you are viewing the video or editing it. The grid defines a mechanism for random access that is useful in both contexts. For purposes of product definition, it may be useful to have both an editor and a viewer, much like the Adobe Acrobat model. For Acrobat, the reader is freely distributed and the editor to create Acrobat documents is sold to a narrow market. What is shown in Figure 1 is the reader while Figure 2 shows the editor. Figure 1: This shows the basic grid of key frames, highlighting the selected frame. In addition, the screen width permitted 2 player windows at the bottom. Click on an image to display it full-size. Double-click to playback the shot. 2. Shot Detection Problems One of the first problems encountered when writing a shot detection component for home video is flash photography. Flashes are very common in home video and dramatically change the color of the entire contents of the shot. This, in general, will result in a shot boundary for most detection algorithms unless the software either restricts shot boundary detection to hard cuts and fades or explicitly searches for shot boundaries caused by flashes. In the first case, flashes are just not detected since they usually cause a double spike in the sequence of color histogram differences of video frames, thus
  • 5. violating the definition of hard cuts. On the other side a double spike is too short to be a fade. It is reasonable to assume that raw video footage does not contain other types of transitions. In the second case, it could be checked whether • the sequence of color histogram differences across a shot boundary exhibits a double spike, • a participating frame posses a significantly higher average brightness than its two delineating frames, and • the two delineating frames’color contents are very similar. Another problem with shot detection is the selection of the key frame. For our purpose it is a temptation to have the first frame of any shot be the key frame to represent the shot. However, the first frame or even the first few frames are often transitions or fades or mangled by a poor quality video camera. A compromise was to display the middle frame as the representative shot. The interface allows the user to override the default to make any memorable frame the representative shot. Alternative algorithms to automatically select a key frame should be explored. One algorithm might be to take not the first shot but a shot some small fraction of the way from the beginning. A more expensive solution might be to average the color of all frames in a shot and find the frame closest to the average color. Figure 2: A Time View showing frames at 1-second intervals. Rather than sequentially searching a video for the moment when the baby was whistling, the Time View allows you to find the frame at a glance. Also shown in this version is the Asset Manager (left side) to manage the list of videos on the machine.
  • 6. 3. Searching the Video The conventional metaphor for playing video is a series of play, pause, and stop buttons seen on most consumer audio devices. Beyond that conventional metaphor, Figure 1 shows some buttons that allow the user to jump to the previous shot, previous frame, next frame and next shot. The ability to play the video in a fast backward and fast forward fashion allows the user to search a sequence. Searching backward and forward is a common means of looking at videotape and is present on many consumer devices. Using fast forward to find a specific frame on the PC is quite common. However, searching backward is not present in any commercially available software player or editor. Our fast backward capability, which is enabled by MPL’ support for s backward frame access in MPEG video, fills in a hole left in most mechanisms to search video on the PC. 4. Time View A Time View (Figure 2) is a series of frames selected at a specific time interval, e.g. one frame for every two seconds. The Time View can provide a simpler way to search than conventional playback because many frames can be viewed at once. The mosaic of frames has all the same ability to click, double-click, or fast forward. More precision can be obtained by selecting a shot or a series of shots and then opening a Zoom View that is simply a Time View for the selected time. A Zoom View can be thought of as a Time View with finer granularity. 5. Asset Management Video files are large and keeping track of them is important as it represents a high percentage of the disk usage. Asset management is an attempt to collect a meaningful view of all the video available to the system. There are two parts to asset management, the Asset Manager and the Asset Summary (see Figure 3.) The Asset Manager is always visible on the left side of the screen. Each asset in the view shows the principle name of the video. This tree view is used to select which video to work on or which view of the video is to be opened. The Asset Summary is a grid containing all video content on the PC. It can be sorted in a variety of ways by clicking on the title of the respective column. The Asset Summary can be used to find a video from a particular date or with a specific description or category. It is difficult to predict what variables will be important to keep in the database but month and year are obvious candidates, although they are often found in the title as well. Putting the month and year in the grid allows the videos to be sorted by date and this can be helpful. Description is potentially useful and should be there. Category is an attempt to think ahead to the time when there are so many videos that there will be a need for further categories. Any field in the grid can be edited just like a spreadsheet and a number of them have pulldown menus to make the edits easier. Figure 3: The Asset Manager along the left side is used to select videos to work on or to open a view of a particular video. The Asset Summary in the work area of the form is an editable grid displaying all the information in the database about each video.
  • 7. 6. Edit, Cut, Copy, and Paste Accessing any frame in a video can be more useful if that frame can be pulled out of the video and copied elsewhere. The image can be placed in an email or posted to the web. Frame accurate control is possible using the frame selection controls (the plus (+) and minus (-) signs in the toolbar). The image is copied to the clipboard in either JPEG or Windows Bitmap format. The Copy function was the first and simplest to implement but the comparison with a word processor only began there. The Copy function simply moves an image and its associated temporal extent into the system clipboard for a later paste. Selecting a frame and then specifying the Cut function removes that shot or frame sequence from the view. If more than one frame is selected, then more than one is cut from the document. Selecting more than one key frame in the grid and using the Copy function makes that series of frames available to a Paste function. That series of shots can then be placed into any location in the video document. Anyone familiar with a word processor or a spreadsheet will have no trouble with the concepts. Figure 4 shows the three steps of highlighting a sequence of 5 shots from a video sequence, cutting out the segment and finally pasting the five shots at the end of the sequence. This sequence of highlight-cut-paste on shots is exactly similar to the analogous operations on text. Figure 5 illustrates the method employed by video editing software --- a timeline is used for the creation of a new video sequence. Segments of video are selected and dropped onto the timeline. Our video document approach allows editing to be made on the existing view of the video in the same fashion as text editing is performed, whereas the present video-editing paradigm requires the creation of a new sequence based on segments of existing ones. The edit functions also work when using more than one video document. For instance, a series of frames in one document may be copied to the clipboard and then pasted onto another video document. When the combined video document is played, the video player will seamlessly switch between videos during playback. To anyone watching the video, it will appear that the two videos are part of the single MPEG file. When the combined video document is saved, it is not saved as a single MPEG file. Instead, the database for the video asset is updated to reflect the new series of shots with their respective video file names. The typical table in the database for such a video document is small, usually less than 10k bytes. Figure 4: The steps of highlighting a sequence of shots, cutting the highlighted segments and pasting the segments at the end of the sequences. Editing is done directly on the current sequence.
  • 8. Figure 5: The current video editing approach --- creating a new video sequence on a timeline view. Editing is performed by the creation of a new sequence. 7. MPEG Native Editor Playing an edited video with the MPL library always uses the original video content for playback and manipulates it on the fly (it is fast enough to produce a seamless video stream.) However, there are a variety of other hardware, software decoders or other DirectShow applications that the consumer might want to use. Our video editor needs to be able to export MPEG files to these other application. The MPEG native editor is provided for that purpose. The MPEG native editor is built with a precise knowledge of the MPEG standard and maintains the bit rate of the original and the series of timecodes in the MPEG stream. There are plenty of video editors available that handle MPEG editing, however, nearly all of them require decompression and recompression of the whole edited video. In contrast, the MPEG native editor just copies those GOPs that are untouched by the editing. Only new shot boundaries need decompression and recompression of the affected GOPs. This reduces the time of computation significantly. 8. Anti-Jitter Playback Working with a variety of home videos means encountering some pretty poor hand-held camera work. Rapid pans, rapid zooms and an inability to hold the camera steady make viewing the video difficult if not nauseating. Many of the modern cameras have image stabilization features that reduce jitter considerably. However, rapid pans and zooms cannot be eliminated even with image stabilization. In addition, many videos have already been created without image stabilization. Therefore, an anti-jitter playback feature has been added to the software specifically to handle these problems. Anti-jitter playback works by giving the user control of the time between frames. One way to interpret it is to say that it converts the video to a slide show with the original audio playing. Allowing the user to focus on an image that is on the screen for one or more seconds reduces the nausea. If the jitter is low, the images may be updated more frequently. If it is high, the image may be held for several seconds. The resources required for anti-jitter playback are less than normal playback because far fewer frames are actually displayed. By doing less work, unviewable video will be seen and heard. For the future, we plan to fight jitter in the narrow sense (i.e., no rapid pan and zooms) also by cropping multiple frames in time so as to keep the “image” stable. 9. CD-ROM Image Documents are meant to be shared. The same is true for video documents. Whatever you have created - the different edited versions of a video, its video abstracts and meta-information - you should be able to pass it on to other people, so that you can share what you have created. Therefore, we provide the capability to burn CD-ROMs that include the MPEG file, the different views, the meta-information as well as self-installing playback software. This playback software is strictly a viewer that will present a grid of video (based on the results of the shot detection) and allow random access. It does not include the edit, cut, and paste features or the Asset Manager. Once you have inserted the video CD-ROM into you computer it will start the viewer application automatically.
  • 9. One positive side effect of creating CD-ROM images of video documents is that it provides an easy and cheap backup solution for the large MPEG files. 10. Date Capture It would be nice if the date of every video were captured along with the video. Digital video recorders have the date encoded within the video stream but once the video is captured (usually in an external capture device), there is no provision to carry the date in the MPEG format. The simple solution is to force the user to provide month and year for every video that is added to the video asset. There is another solution but it requires a great deal more processing to obtain (about 10 minutes for a 1-hour MPEG-1 video on a Pentium II 450Mz). Most home video cameras have a provision to place a month, day, and year on the video as it is being captured. This date can be obtained automatically by some unconventional Optical Character Recognition (OCR) techniques [7-12]. Conventional OCR would take a single frame and try to isolate the characters by finding the edges, performing some horizontal and vertical transformation to create projections that are looked up in a table. However, OCR on a video frame is not the simple black and white problem that appears when working on the printed page. The letters are typically white or light gray but the area behind the letters may be snow, making the letters virtually invisible. The technique used here is a simplified version of the one found in a paper by [11]. It basically takes advantage of the fact that the date’ location is fixed over time and that the date’ color is very bright. For every shot the frames are stacked one s s over each other and the minimum pixel intensity is calculated over time for each pixel position. As a result, only date text pixel will keep their brightness and can therefore be extracted by simple thresholding the combined image and removing small regions not meeting the geometric requirements of characters. In order to improve the quality of the extracted date character bitmaps, all images are scaled up by a factor of four using cubic interpolation before applying any operations. Next, we utilize the fact that the date text has a very simple structure such as “MM DD YYYY” or “HH: MM: SS in order to correct falsely recognized characters, fill-in missed characters and judge whether the recognition is correct or needs manual revision. On 5.5 hours (676 shots) of home video from analog tapes, 96% of all dates were identified and judged correct, 3.1% were marked as needing manual resolution, and 0. 9% were recognized incorrectly. 5. VIDEO ABSTRACTS Recording home videos with camcorders is much more popular than playing them back. This difference is due to the fact that unedited home video footage is usually long-winded, lacks visually appealing effects, and thus tends to be too time- consuming and boring to watch. However, most people do not have time to edit their videos, and even if they would have, the resulting video is too inflexible and cannot adjust to the viewers’ needs. For instance, if some friends visit its very likely that you only want to show them a 15 minute excerpt of your last vacation in order not to bore your guests, while when your parents visit, a longer excerpt could be tolerable. This cannot be done easily with current systems. However, a system capable of abstracting raw video into shorter video automatically in real-time could give a user that flexibility. It would easily generate video abstracts satisfying the individual time-constraints. There are other papers written to describe how to create a video abstract from a home video [13]. The importance here is that a short abstract can be created within a few seconds for any number of videos. There are two types of video abstracts created with our software. Both methods require that each shot be cut down in a preprocessing step to a short clip of ten seconds at most showing only the most important part. Importance is here defined as the part of the shot that has the largest audio support [13]. Once the shots have been shortened two different kinds of abstracts can be created. The first abstract is a simple Random Abstract that chooses clips randomly until the target abstract length is reached. Each time a new abstract is created is will be different. The length of the abstract is specified along with the list of videos to abstract. A simple request might select five minutes from three videos. The results are presented in a grid which can be played or replayed any number of times. The second type of abstract is called Smart Abstract. The difference is that the date of recording (see Section 4) is used to cluster shots into a hierarchy of shot clusters of weeks and days. This hierarchy of shot clusters is then used to create significantly better abstracts compared to the random abstracts. A Smart Abstract is driven by rules that implement the following objects in order to create good-quality abstracts:
  • 10. • Balanced Coverage. The video abstract should be composed of clips from all parts of the source video set. • Shortened Shots. Commonly, the duration of the raw, unedited shots is too long and the content too long-winded for video abstracts. Moreover, their uncut presence in video abstracts does not offer a balanced coverage of the source video material. Therefore, shots exceeding a maximum length must be cut down to their most interesting parts. • Random Selection. Due to the nature of home video material, all shots are generally more or less equally important. In addition, individual abstracts of the same source videos should vary each time in order to hold interest after multiple playbacks. Therefore, “controlled” random clip selection should be a key feature of our video abstracting procedure. • Focused Selection. If the abstraction ratio is high, commonly the case, the abstracting algorithm should focus only on a random subset of week clusters and the corresponding day clusters. Thus, a more detailed coverage on selected clusters is preferred over a totally balanced, but more superficial coverage. Currently, there are no attempts to provide visual cues to the transitions between shots in the abstract but it would be highly desirable. Showing a simple transition such as a door opening or a horizontal wipe gives the view a sense that there is a break in the time. It may be argued that creating a video abstract may be obsolete when scanning video is so easy using a shot view. Jumping from frame to frame in a grid and playing small portions is the reason for the grid view. However, if creating a short video to share with others who are passively watching, the abstract is demonstrably useful. 6. MARKET COMPARISONS The market for video capture devices is growing rapidly as more consumers discover utility in capturing video on the PC. The video software market will expand with it because it is tied to the market for the hardware. The software for video browsing, video editing, and video managing needs to become independent of the hardware just like the market for word processors is completely independent of the platform. It is a question of providing an abstraction of what consumers like to do with video. The video document is an instantiation of such an abstraction. A market survey of existing products would be quickly out of date. However, some general statements are possible. Almost all the products suffer from one affliction or another but more importantly, some of them exhibit a few of the desirable traits outlined in this paper. Some of the current software products suffer from problems that can be seen in Figure 6. No one can complain about an interface that looks like a TV from the Jetson’ cartoon show s but the failure to resize to the larger screen is unproductive. Specialized cursors, filmstrips instead of grids, and modal interfaces are common mistakes found in some of the current crop of products. The filmstrip paradigm is unproductive for most consumers because the filmstrip does not effectively use the available screen space. It is inherently one- dimensional. Doubling the screen size only results in a filmstrip twice as long while the Grid View will be able to Figure 6: The typical video capture/edit process does show four times as much content as before. not make good use of the space on a 1600x1200 screen. Video Assets are managed on the right with However, despite these faults, there is some evidence that only 8 characters for the name. Unique cursors features present in our software are showing up in come and go depending on both the mode and where commercially available products but each of those features is the mouse hovers. The ubiquitous filmstrip appears crippled by the presence of the filmstrip legacy. Using these at the bottom. software products is like reading a book without a good editor. For instance, there are products that employ shot detection to create a small grid of video but they do not work on MPEG files. There are MPEG editors but they have no shot detection so manual intervention is required to define boundaries. There are equivalents to the Asset Manager but they have
  • 11. no sort capability. There are ways to create MPEG files from multiple originals but they are slow (since they need decompression and recompression), hard-to-learn, and often modal. Nowhere is the design as complete as the one proposed here and even if our software is unsuccessful, it may help to define what good video software should do. Most of the software available comes packaged with a hardware solution that incorporates features of the capture device into the interface. This is like selling a keyboard with a word processor that only works with that keyboard. This paper is an attempt to create an abstraction of what this new media of video on the PC requires. 7. INTERNET VIDEO The key problem with video on the Internet is bandwidth. In order to transmit video, a streaming server first downloads buffers to play while more buffers are downloaded. This prevents running out of buffers and stalling the video. The time consumed waiting to start a video may be 30 seconds or more before the first frame of video is seen. Worse, there is no way for the viewer to go faster or jump around in the stream because the only buffers queued up are those that would be used sequentially. Once consumed, the buffers are tossed. If a second viewing of the video is requested, the streaming server will queue the buffers and transmit the video again from the beginning with the same delays as before. Figure 7: A grid of pictures in a web page that allows selecting where to begin the video. One alternative is a grid view full of key frames in an HTML page (Figure 7). This would allow the user to select where to begin viewing. The user has still frames in the grid to look at while waiting to download buffers. In addition, the grid provides a visual cue that shows what has been just streamed to the client PC. Once viewed, the user can select only that portion of the video that they want to see again. The grid interface does not solve all the problems of streaming video. First, the video must still be buffered so there are delays no matter where viewing starts. In addition, unless the web page displays some additional text or graphics, a single frame may not be sufficient help to the viewer trying to determine where to start the video. Nonetheless, streaming video from web servers is on the rise (typically movie trailers) and a grid view in a web page would provide a better way to get the user more control over what they see while at the same reducing the bandwidth demands on the network. 8. CONCLUSION What conclusions can we make now that we have experimented with this software for most of a year? First, it is clear that some of these attempts to create new formats for the presentation and use of video on the PC should receive further experimentation, preferably in the marketplace or some format which enlists consumer input. The market potential for a successful tool is enormous because so many people have PCs and so many will have video on those PCs in the coming years. Video on the PC is a subject that has universal appeal. The behavior of real people using video on the PC is a sociological question that properly should be answered by studies and high-volume market surveys. It is possible for us to present data describing how many individuals preferred one format or feature to another but engineers do not make the best marketing analysts. Such a market survey is beyond the scope of this paper which is intended only to present the alternatives. A better methodology is available to each of us through introspection and experimentation. This has been the approach used in this paper - to design a tool that is intended for use at home with our own home videos. The feedback loop is similar to that of a tinkerer who will keep trying until it is right. Even without any consumer surveys or experimental data, some conclusions stand out as potentially valuable contributions to the use of video on the PC. The first conclusion is that shot detection is vital to whatever process follows. The grid of key
  • 12. frames provides a scaffold to support other structures that may be built. A second conclusion is that the grid view with a healthy respect for the efficient use of screen space is an efficient way to present video. It provides the mechanism for random access as well as an efficient way to sense what is in the video content. It remains to be seen if building on the metaphor of word processing is useful to consumers. Will edit, cut, and paste be important interactions with video? Will consumers just want to watch the video or will random access matter? The behavior of frustrated channel surfers is an indication that more interactions with video content are desirable and should be pursued. Regardless of what happens, it is important to jettison the metaphor of film and treat the PC as a new kind of medium for video. The video document concept represents a historically unencumbered way of pursuing the future. The features of a video document that are most likely to be embraced are: • Shot detection provides a simple way to organize video content. • A grid presentation of key frames is an efficient overview and creates the mechanism for random access. • The word processing metaphors of edit, cut, and paste build on existing, well-understood uses of the PC. • MPEG files are the medium for exchange to save disk space. • Seamless playback while editing is an essential WSIWYG feature. • The ability to create new MPEG files from edited views is essential for sharing to other venues. • Video content needs some form of asset management, preferably sorted in a variety of ways. • Individual frames can be searched, copied, and pasted elsewhere. • A self-installing playback tool should be provided for CD-ROM images for sharing to other PCs. • The date on the video should be automatically captured, either via OCR or directly from the DV. The above features are viewed as essential to enhancing the use of video on the PC. These additional features are less convincing features of the video document that need further exploration and experimentation: • Some methods for creating an abstract can be useful for exporting to other playback environments. • The anti-jitter playback of poor quality video is desirable but a more satisfying result should still be attempted. Some other areas for further work include: • We completely neglected titling in our implementations. This was a practical choice guided by the demands of research. It is reasonable to expect that users will want to place titles in their videos. • Providing some transitions and special effects would mark time changes or editing cuts. Abstracts would benefit from the visual cue of a transition effect. 9. REFERENCES 1. J. S. Boreczky and L. A. Rowe. Comparison of Video Shot Boundary Detection Techniques. In Storage and Retrieval for Still Image and Video Databases IV, Proc. SPIE 2664, pp. 170-179, Jan. 1996. 2. A. Dailianas, R. B. Allen, P. England: Comparison of Automatic Video Segmentation Algorithms. In Integration Issues in Large Commercial Media Delivery Systems, Proc. SPIE 2615, pp. 2-16, Oct. 1995. 3. A. Hampapur, R. C. Jain, and T. Weymouth. Production Model Based Digital Video Segmentation. Multimedia Tools and Applications, Vol. 1, No. 1, pp. 9-46, Mar. 1995. 4. R. Lienhart. Comparison of Automatic Shot Boundary Detection Algorithms. In Storage and Retrieval for Image and Video Databases VII, SPIE Vol. 3656, pp. 290-301, Jan. 1999. 5. B.-L. Yeo and B. Liu. Rapid Scene Analysis on Compressed Video. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 5, No. 6, pp. 533-544, December 1995. 6. R. Zabih, J. Miller, and K. Mai. A Feature-Based Algorithm for Detecting and Classifying Scene Breaks. Proceedings ACM Multimedia 95, San Francisco, CA, pp. 189-200, Nov. 1995. 7. H. Li, D. Doermann and O. Kia. Automatic Text Detection and Tracking in Digital Video. IEEE Trans. on Image Processing. To appear. 8. R. Lienhart. Automatic Text Recognition for Video Indexing. In Proceedings of the ACM Multimedia ‘ (Boston, 96, MA, USA, November 11-18, 1996), S. 11-20, November 1996.
  • 13. 9. R. Lienhart and W. Effelsberg. Automatic Text Segmentation and Text Recognition for Video Indexing. Technical Report TR-98-009, Praktische Informatik IV, University of Mannheim, May 1998. To appear in ACM/ Springer Multimedia Systems Magazine. 10. A. K. Jain and S. Bhattacharjee. Text Segmentation Using Gabor Filters for Automatic Document Processing. Machine Vision and Applications, Vol. 5, No. 3, S. 169-184, 1992. 11. T. Sato, T. Kanade, E. Hughes, and M. Smith. Video OCR for Digital News Archives. IEEE Workshop on Content- Based Access of Image and Video Databases (CAIVD'98), Bombay, India, January, 1998. 12. V. Wu, R. Manmatha and E. M. Riseman. Finding Text in Images. In Proceedings of Second ACM International Conference on Digital Libraries, Philadelphia, PA, pp. 23-26, July 1997. 13. R. Lienhart and B.-L. Yeo. Automatic Abstraction of Home Video Footage into Shorter Video. Submitted to IEEE Transactions on Multimedia.