• Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
445
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
2
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. The Video Document Bob Davies, Rainer Lienhart, and Boon-Lock Yeo Intel Corporation Microcomputer Research Lab 2200 Mission College Blvd. Santa Clara, CA 95052 - 8119 {Bob.Davies, Rainer.Lienhart, Boon-Lock.Yeo}@intel.com ABSTRACTThe metaphor of film and TV permeates the design of software to support video on the PC. Simply transplanting the non-interactive, sequential experience of film to the PC fails to exploit the virtues of the new context. Video on the PC should beinteractive and non-sequential. This paper experiments with a variety of tools for using video on the PC that exploits the newcontext of the PC. Some features are more successful than others. Applications that use these tools are explored, includingprimarily the home video archive but also streaming video servers on the Internet. The ability to browse, edit, abstract andindex large volumes of video content such as home video and corporate video is a problem without appropriate solution intoday’ market. The current tools available are complex, unfriendly (professional) video editors, requiring hours of work to sprepare a short home video, far more work that a typical home user can be expected to provide.Our proposed solution treats video like a text document, providing functionality similar to a text editor. Users can browse,interact, edit and compose one or more video sequences with the same ease and convenience as handling text documents.With this level of text-like composition, we call what is normally a sequential medium a “video document”. An importantcomponent of the proposed solution is shot detection, the ability to detect when a shot started or stopped. When combinedwith a spreadsheet of key frames, the shots become a grid of pictures that can be manipulated and viewed in the same waythat a spreadsheet can be edited. Multiple video documents may be viewed, joined, manipulated, and seamlessly playedback. Abstracts of unedited video content can be produced automatically to create novel video content for export to othervenues. Edited and raw video content can be published to the net or burned to a CD-ROM with a self-installing viewer forWindows 98 and Windows NT 4. 0.Keywords: Video grid, shot grid, video editor, MPEG-native editor, home video archive, shot detection, video abstraction,video/audio content analysis. 1. INTRODUCTIONThere are times when the past encumbers what we do and the way we think. In the evolution of the automobile, the early carwas simply a carriage with a motor. We are at a similar junction with video on the PC. The metaphor of film permeates thedesign of video software for the PC. Software products to manage and edit video on the PC are making us ride in a carriagewhen better designs should be made available to the consumer.Video on the PC is not film. Film is a non-interactive, sequential, shared experience while video on the PC is interactive,randomly accessible, and typically, unshared experience. Transplanting the experience of film onto the PC withoutsignificant change in design will not leverage the advantages of the PC.The PC is a tool that can provide a different experience. PC’ are good at providing random access whether searching for smedical information on the web or finding a recent email. It is meant to be interactive and responsive to requests from whatis typically one user. Using the PC for video without interactivity would only make it an expensive VCR.Video on the PC is a relatively new phenomenon, especially at the consumer level. The capture devices are only nowbecoming widely available and we can start experimenting. There are 2 questions these experiments can help answer: 1. What is the information to be organized? 2. Can the content benefit from an interactive presentation?
  • 2. In the rest of the introduction our attention will focus the first question while we will address the second one in the rest of thepaper. So what kind of videos qualifies them naturally as a playground?Movies and TV shows are probably not useful to be used on the PC because the content is strongly oriented towards a streamof shared experience. Jumping around inside “Saving Private Ryan” would probably not enhance the viewing experience.The PC simply does not have much to do other than playing the video. There might be exceptions such as skippingcommercials or replaying sporting events. Jumping around the tape of a baseball game to find the action, repeating theimportant plays, or single stepping through the controversial plays, might significantly enhance the viewing experience.A second choice might be educational materials. Imagine a web site with video showing origami (paper folding) for a varietyof objects. You would select the swan or balloon that you would enjoy learning how to create and watch just that section.You would be able to refer to it again as you struggle with the different folds. However, there is not a lot of such materialavailable. In addition, the integration of the content with the presentation is an important design element that might yieldunique solutions useful only to, say, origami.Both film and educational materials suffer from an additional conflict: ownership. Finding a novel presentation ofcopyrighted material could be enormously frustrating. A technical problem solved without a legal context is an invitation tolitigation.What content is available for the PC? Since about the mid-1980’ consumer video cameras have been widely available. sParents have bookshelves full of videotapes from nearly every one of their child’ events, from birthdays to sporting events. sUnfortunately, the content of this bookshelf often goes unused. A tape labeled with the location of a recent vacation mightalso have the winning goal from a soccer game. A skiing trip can be on the end of one tape and on the beginning of another.Worse, sorting and organizing tapes are almost never done so even finding a tape from the right time period is difficult.Also, once the tape is found, finding the right place on the tape is troublesome, especially since there is no certainty about thecontents.Home video may not be the most exciting video content but it is a good place to experiment with what the PC can do. Asolution to the problem of organizing home videos will provide a good metaphor for other categories such as organizing newsfootage for a local TV station or providing video reference material in a library.Our approach treats video like a text document, providing functionality similar to a text editor. The rest of this paper presentsin detail our design of “video document” and several experiments we conducted to explore the variety of things that a PC cando with video. 2. TOOLSBefore discussing the experiments with a home video archive or any other application, a short summary of the enabling toolswill serve to orient developers to follow. The target platform is Windows 98tm and Windows NTtm, primarily to allow thesoftware to be widely shared with colleagues. The programming tools were Microsoft Visual Basictm for the user interfaceand a C++ compiler for the component libraries. The justification for using Visual Basic was to make it simple to try adesign, throw it out and try again. The studied minimalism of code written in VB makes this possible.Of all the components used, none is more important than MPEG Processing Library (MPL), a general purpose and optimizedMPEG processing software infrastructure developed in our lab. The library provides random access to any frame and wasconsiderably faster than all commercially available libraries. The MPL component also allowed the creation of necessarymethods, properties, and events to make the experimentation easier.Shot detection (discussed later) is a key component to the process of creating visual overview of video on the PC. Shotdetection is often the first step after capturing video because it provides the skeleton for subsequent views.Common to almost all presentations is the use of a grid control that allows cells in the grid to contain pictures. VideoSoft’sVSFlextm control is one such control that is commercially available. Other controls are available which can provide a similarservice.Databases are not more efficient than simple flat files. When used to manage video, the database need only contain a fewnumbers such as the starting frame number, ending frame number, and the representative key frame. However, by using
  • 3. database technology, the meaning of the individual fields is transparent and self-documenting. Anyone can understand yourdefinitions using a variety of database programs. More important, adding new fields for different experiments such asallowing the user to associate a description with any shot is easily handled while maintaining backward compatibility. Thespecific database tool employed is Microsoft’ Data Access Objectstm (DAO.) Microsoft’ Accesstm is the tool used to verify s sthe contents of the database as well as expose field names and definitions.Installation and setup are important steps in the proliferation of any software tool. Developing the setup script whiledeveloping the software is a simple empirical method to ensure that software gets into the hands of real users. Great productshave been trapped inside poorly managed setup scripts. A number of commercially available packages were evaluated andall suffered from the same deficiencies – unique scripting languages, quirky user interfaces, and a failure to make theinstallation process transparent. Writing the setup script in a widely available language does more than make the syntaxfamiliar. It makes the process transparent and the complexities of installation in a Microsoft environment approachable. Asimple bootstrap program ensures that all necessary components of the Visual Basic environment are present. The VB setupprogram takes control, allowing a flexible user interface with the familiar wizard metaphor. Serious effort was put intomaking the install process as fast as possible while avoiding a reboot as often as possible. 3. VIDEO CAPTURE AND PLAYBACKThe purpose of this paper is to focus on what happens after the video has been captured. However, capture has to be done tomake the content available as an MPEG-1/2 file on the PC. Once a video is captured it is desirable to always preserve theoriginal. In our scheme, all changes to the content like insertion, deletion, and transitions do not alter the original, only theway it is played. Alternative views or abstractions may be created in a future examination of the videos but the originals willremain intact. Some portion of the video that is not interesting now could prove invaluable later. Video is a lot for mostPC’ to digest at this point, and eliminating copies of even portions of a video is desirable and could be desirable even when sit is no longer necessary.Critical to this efficiency of keeping the original copy is seamless playback of edited video. Without seamless playback,viewer/editor disparities can emerge. If the playback of a video while editing were different from the playback when simplyviewing the video, it would no longer be a WSIWYG solution. MPL provides real-time access to individual I, P, and Bframes by means of the creation of an index file. 4. VIDEO DOCUMENT1. Shot ViewThe first step in building a video document on the PC is shot detection. A shot is a video sequence recorded by a camera’ suninterrupted operation. For instance, if the video contains a baby walking, then breaks, and then shows the baby sitting in ahigh chair eating, that break would define a shot boundary. The camera was stopped, pointed at something new, and startedagain.In principle, shot boundaries can be detected automatically in two ways: Either there is some metadata attached with thevideo that allows finding shot boundaries or the shot boundaries must be deduced by analyzing the visual stream. In the caseof digital video (DV) the cameras encode the date and time of recording with the video stream, which makes it possible for acomputer to find any break in the time line easily. However, on conventional analog video cameras, this information eitherdoes not exist or is not externally available. If the video is MPEG encoded by an external device, any time information evenin DV is gone. Therefore, shot detection is performed on the visual stream.Shot detection is a widely researched area [1-6]. We use the hard cut detection algorithm proposed in [5] and the fadedetector in [4] to determine shots. Since both detectors work on only the DC coefficients of the MPEG video, our shotdetector runs 19 times faster than real-time on a Pentium-II 450 MHz.Once the shots have been detected, a key frame from the middle of the shot is used as a representative of the shot. Preparinga grid full of these key frames is a natural way to provide random access to any shot within a video and is called the ShotView (see Figure 1.) Some natural ways of using a grid fall out from showing the Shot View. Clicking on a frame willopen it in a full-size video player. Double-clicking any key frame will play the shot starting at that location.Because the grid contains many small images, it is useful to put several players at the bottom of the screen. Clicking animage readies one of the players at that frame and enlarges the frame to full size. This allows closer inspection beforeplaying. Additionally, similar images can be compared when both are enlarged in separate players.
  • 4. The size of the key frames in the grid is variable. Some videos will benefit from smaller images so that more can appear onthe screen. Making the images as small as possible allows more on the screen and easier scanning and scrolling. If moredetail is desired the key frames can be enlarged.One of the key features of the grid format is that it does not matter whether you are viewing the video or editing it. The griddefines a mechanism for random access that is useful in both contexts. For purposes of product definition, it may be useful tohave both an editor and a viewer, much like the Adobe Acrobat model. For Acrobat, the reader is freely distributed and theeditor to create Acrobat documents is sold to a narrow market. What is shown in Figure 1 is the reader while Figure 2shows the editor.Figure 1: This shows the basic grid of key frames, highlighting the selected frame. In addition, the screen widthpermitted 2 player windows at the bottom. Click on an image to display it full-size. Double-click to playback theshot.2. Shot Detection ProblemsOne of the first problems encountered when writing a shot detection component for home video is flash photography.Flashes are very common in home video and dramatically change the color of the entire contents of the shot. This, ingeneral, will result in a shot boundary for most detection algorithms unless the software either restricts shot boundarydetection to hard cuts and fades or explicitly searches for shot boundaries caused by flashes. In the first case, flashes are justnot detected since they usually cause a double spike in the sequence of color histogram differences of video frames, thus
  • 5. violating the definition of hard cuts. On the other side a double spike is too short to be a fade. It is reasonable to assume thatraw video footage does not contain other types of transitions. In the second case, it could be checked whether• the sequence of color histogram differences across a shot boundary exhibits a double spike,• a participating frame posses a significantly higher average brightness than its two delineating frames, and• the two delineating frames’color contents are very similar.Another problem with shot detection is the selection of the key frame. For our purpose it is a temptation to have the firstframe of any shot be the key frame to represent the shot. However, the first frame or even the first few frames are oftentransitions or fades or mangled by a poor quality video camera. A compromise was to display the middle frame as therepresentative shot. The interface allows the user to override the default to make any memorable frame the representativeshot. Alternative algorithms to automatically select a key frame should be explored. One algorithm might be to take not thefirst shot but a shot some small fraction of the way from the beginning. A more expensive solution might be to average thecolor of all frames in a shot and find the frame closest to the average color.Figure 2: A Time View showing frames at 1-second intervals. Rather than sequentially searching a video for themoment when the baby was whistling, the Time View allows you to find the frame at a glance. Also shown in thisversion is the Asset Manager (left side) to manage the list of videos on the machine.
  • 6. 3. Searching the VideoThe conventional metaphor for playing video is a series of play, pause, and stop buttons seen on most consumer audiodevices. Beyond that conventional metaphor, Figure 1 shows some buttons that allow the user to jump to the previous shot,previous frame, next frame and next shot. The ability to play the video in a fast backward and fast forward fashion allowsthe user to search a sequence.Searching backward and forward is a common means of looking at videotape and is present on many consumer devices.Using fast forward to find a specific frame on the PC is quite common. However, searching backward is not present in anycommercially available software player or editor. Our fast backward capability, which is enabled by MPL’ support for sbackward frame access in MPEG video, fills in a hole left in most mechanisms to search video on the PC.4. Time ViewA Time View (Figure 2) is a series of frames selected at a specific time interval, e.g. one frame for every two seconds. TheTime View can provide a simpler way to search than conventional playback because many frames can be viewed at once.The mosaic of frames has all the same ability to click, double-click, or fast forward. More precision can be obtained byselecting a shot or a series of shots and then opening a Zoom View that is simply a Time View for the selected time. A ZoomView can be thought of as a Time View with finer granularity.5. Asset ManagementVideo files are large and keeping track of them is important as it represents a high percentage of the disk usage. Assetmanagement is an attempt to collect a meaningful view of all the video available to the system. There are two parts to assetmanagement, the Asset Manager and the Asset Summary (see Figure 3.) The Asset Manager is always visible on the leftside of the screen. Each asset in the view shows the principle name of the video. This tree view is used to select which videoto work on or which view of the video is to be opened.The Asset Summary is a grid containing all video content on the PC. It can be sorted in a variety of ways by clicking on thetitle of the respective column. The Asset Summary can be used to find a video from a particular date or with a specificdescription or category.It is difficult to predict what variables will be important to keep in the database but month and year are obvious candidates,although they are often found in the title as well. Putting the month and year in the grid allows the videos to be sorted bydate and this can be helpful. Description is potentially useful and should be there. Category is an attempt to think ahead tothe time when there are so many videos that there will be a need for further categories.Any field in the grid can be edited just like a spreadsheet and a number of them have pulldown menus to make the editseasier.Figure 3: The Asset Manager along the left side is used to select videos to work on or to open a view of a particularvideo. The Asset Summary in the work area of the form is an editable grid displaying all the information in thedatabase about each video.
  • 7. 6. Edit, Cut, Copy, and PasteAccessing any frame in a video can be more useful if that frame can be pulled out of the video and copied elsewhere. Theimage can be placed in an email or posted to the web. Frame accurate control is possible using the frame selection controls(the plus (+) and minus (-) signs in the toolbar). The image is copied to the clipboard in either JPEG or Windows Bitmapformat.The Copy function was the first and simplest to implement but the comparison with a word processor only began there. TheCopy function simply moves an image and its associated temporal extent into the system clipboard for a later paste.Selecting a frame and then specifying the Cut function removes that shot or frame sequence from the view. If more than oneframe is selected, then more than one is cut from the document.Selecting more than one key frame in the grid and using the Copy function makes that series of frames available to a Pastefunction. That series of shots can then be placed into any location in the video document. Anyone familiar with a wordprocessor or a spreadsheet will have no trouble with the concepts. Figure 4 shows the three steps of highlighting a sequenceof 5 shots from a video sequence, cutting out the segment and finally pasting the five shots at the end of the sequence. Thissequence of highlight-cut-paste on shots is exactly similar to the analogous operations on text. Figure 5 illustrates the methodemployed by video editing software --- a timeline is used for the creation of a new video sequence. Segments of video areselected and dropped onto the timeline. Our video document approach allows editing to be made on the existing view of thevideo in the same fashion as text editing is performed, whereas the present video-editing paradigm requires the creation of anew sequence based on segments of existing ones.The edit functions also work when using more than one video document. For instance, a series of frames in one documentmay be copied to the clipboard and then pasted onto another video document. When the combined video document is played,the video player will seamlessly switch between videos during playback. To anyone watching the video, it will appear thatthe two videos are part of the single MPEG file.When the combined video document is saved, it is not saved as a single MPEG file. Instead, the database for the video assetis updated to reflect the new series of shots with their respective video file names. The typical table in the database for such avideo document is small, usually less than 10k bytes. Figure 4: The steps of highlighting a sequence of shots, cutting the highlighted segments and pasting the segments at the end of the sequences. Editing is done directly on the current sequence.
  • 8. Figure 5: The current video editing approach --- creating a new video sequence on a timeline view. Editing is performed by the creation of a new sequence.7. MPEG Native EditorPlaying an edited video with the MPL library always uses the original video content for playback and manipulates it on thefly (it is fast enough to produce a seamless video stream.) However, there are a variety of other hardware, software decodersor other DirectShow applications that the consumer might want to use. Our video editor needs to be able to export MPEGfiles to these other application.The MPEG native editor is provided for that purpose. The MPEG native editor is built with a precise knowledge of theMPEG standard and maintains the bit rate of the original and the series of timecodes in the MPEG stream. There are plentyof video editors available that handle MPEG editing, however, nearly all of them require decompression and recompressionof the whole edited video. In contrast, the MPEG native editor just copies those GOPs that are untouched by the editing.Only new shot boundaries need decompression and recompression of the affected GOPs. This reduces the time ofcomputation significantly.8. Anti-Jitter PlaybackWorking with a variety of home videos means encountering some pretty poor hand-held camera work. Rapid pans, rapidzooms and an inability to hold the camera steady make viewing the video difficult if not nauseating. Many of the moderncameras have image stabilization features that reduce jitter considerably. However, rapid pans and zooms cannot beeliminated even with image stabilization. In addition, many videos have already been created without image stabilization.Therefore, an anti-jitter playback feature has been added to the software specifically to handle these problems.Anti-jitter playback works by giving the user control of the time between frames. One way to interpret it is to say that itconverts the video to a slide show with the original audio playing. Allowing the user to focus on an image that is on thescreen for one or more seconds reduces the nausea. If the jitter is low, the images may be updated more frequently. If it ishigh, the image may be held for several seconds.The resources required for anti-jitter playback are less than normal playback because far fewer frames are actually displayed.By doing less work, unviewable video will be seen and heard. For the future, we plan to fight jitter in the narrow sense (i.e.,no rapid pan and zooms) also by cropping multiple frames in time so as to keep the “image” stable.9. CD-ROM ImageDocuments are meant to be shared. The same is true for video documents. Whatever you have created - the different editedversions of a video, its video abstracts and meta-information - you should be able to pass it on to other people, so that you canshare what you have created. Therefore, we provide the capability to burn CD-ROMs that include the MPEG file, thedifferent views, the meta-information as well as self-installing playback software. This playback software is strictly a viewerthat will present a grid of video (based on the results of the shot detection) and allow random access. It does not include theedit, cut, and paste features or the Asset Manager. Once you have inserted the video CD-ROM into you computer it will startthe viewer application automatically.
  • 9. One positive side effect of creating CD-ROM images of video documents is that it provides an easy and cheap backupsolution for the large MPEG files.10. Date CaptureIt would be nice if the date of every video were captured along with the video. Digital video recorders have the date encodedwithin the video stream but once the video is captured (usually in an external capture device), there is no provision to carrythe date in the MPEG format. The simple solution is to force the user to provide month and year for every video that is addedto the video asset.There is another solution but it requires a great deal more processing to obtain (about 10 minutes for a 1-hour MPEG-1 videoon a Pentium II 450Mz). Most home video cameras have a provision to place a month, day, and year on the video as it isbeing captured. This date can be obtained automatically by some unconventional Optical Character Recognition (OCR)techniques [7-12].Conventional OCR would take a single frame and try to isolate the characters by finding the edges, performing somehorizontal and vertical transformation to create projections that are looked up in a table. However, OCR on a video frame isnot the simple black and white problem that appears when working on the printed page. The letters are typically white orlight gray but the area behind the letters may be snow, making the letters virtually invisible.The technique used here is a simplified version of the one found in a paper by [11]. It basically takes advantage of the factthat the date’ location is fixed over time and that the date’ color is very bright. For every shot the frames are stacked one s sover each other and the minimum pixel intensity is calculated over time for each pixel position. As a result, only date textpixel will keep their brightness and can therefore be extracted by simple thresholding the combined image and removingsmall regions not meeting the geometric requirements of characters. In order to improve the quality of the extracted datecharacter bitmaps, all images are scaled up by a factor of four using cubic interpolation before applying any operations.Next, we utilize the fact that the date text has a very simple structure such as “MM DD YYYY” or “HH: MM: SS in order tocorrect falsely recognized characters, fill-in missed characters and judge whether the recognition is correct or needs manualrevision. On 5.5 hours (676 shots) of home video from analog tapes, 96% of all dates were identified and judged correct,3.1% were marked as needing manual resolution, and 0. 9% were recognized incorrectly. 5. VIDEO ABSTRACTSRecording home videos with camcorders is much more popular than playing them back. This difference is due to the factthat unedited home video footage is usually long-winded, lacks visually appealing effects, and thus tends to be too time-consuming and boring to watch. However, most people do not have time to edit their videos, and even if they would have,the resulting video is too inflexible and cannot adjust to the viewers’ needs. For instance, if some friends visit its very likelythat you only want to show them a 15 minute excerpt of your last vacation in order not to bore your guests, while when yourparents visit, a longer excerpt could be tolerable. This cannot be done easily with current systems. However, a systemcapable of abstracting raw video into shorter video automatically in real-time could give a user that flexibility. It wouldeasily generate video abstracts satisfying the individual time-constraints.There are other papers written to describe how to create a video abstract from a home video [13]. The importance here is thata short abstract can be created within a few seconds for any number of videos. There are two types of video abstracts createdwith our software. Both methods require that each shot be cut down in a preprocessing step to a short clip of ten seconds atmost showing only the most important part. Importance is here defined as the part of the shot that has the largest audiosupport [13]. Once the shots have been shortened two different kinds of abstracts can be created.The first abstract is a simple Random Abstract that chooses clips randomly until the target abstract length is reached. Eachtime a new abstract is created is will be different. The length of the abstract is specified along with the list of videos toabstract. A simple request might select five minutes from three videos. The results are presented in a grid which can beplayed or replayed any number of times.The second type of abstract is called Smart Abstract. The difference is that the date of recording (see Section 4) is used tocluster shots into a hierarchy of shot clusters of weeks and days. This hierarchy of shot clusters is then used to createsignificantly better abstracts compared to the random abstracts. A Smart Abstract is driven by rules that implement thefollowing objects in order to create good-quality abstracts:
  • 10. • Balanced Coverage. The video abstract should be composed of clips from all parts of the source video set. • Shortened Shots. Commonly, the duration of the raw, unedited shots is too long and the content too long-winded for video abstracts. Moreover, their uncut presence in video abstracts does not offer a balanced coverage of the source video material. Therefore, shots exceeding a maximum length must be cut down to their most interesting parts. • Random Selection. Due to the nature of home video material, all shots are generally more or less equally important. In addition, individual abstracts of the same source videos should vary each time in order to hold interest after multiple playbacks. Therefore, “controlled” random clip selection should be a key feature of our video abstracting procedure. • Focused Selection. If the abstraction ratio is high, commonly the case, the abstracting algorithm should focus only on a random subset of week clusters and the corresponding day clusters. Thus, a more detailed coverage on selected clusters is preferred over a totally balanced, but more superficial coverage.Currently, there are no attempts to provide visual cues to the transitions between shots in the abstract but it would be highlydesirable. Showing a simple transition such as a door opening or a horizontal wipe gives the view a sense that there is abreak in the time.It may be argued that creating a video abstract may be obsolete when scanning video is so easy using a shot view. Jumpingfrom frame to frame in a grid and playing small portions is the reason for the grid view. However, if creating a short videoto share with others who are passively watching, the abstract is demonstrably useful. 6. MARKET COMPARISONSThe market for video capture devices is growing rapidly as more consumers discover utility in capturing video on the PC.The video software market will expand with it because it is tied to the market for the hardware. The software for videobrowsing, video editing, and video managing needs to become independent of the hardware just like the market for wordprocessors is completely independent of the platform. It is a question of providing an abstraction of what consumers like todo with video. The video document is an instantiation of such an abstraction.A market survey of existing products would be quickly outof date. However, some general statements are possible.Almost all the products suffer from one affliction or anotherbut more importantly, some of them exhibit a few of thedesirable traits outlined in this paper.Some of the current software products suffer from problemsthat can be seen in Figure 6. No one can complain about aninterface that looks like a TV from the Jetson’ cartoon show sbut the failure to resize to the larger screen is unproductive.Specialized cursors, filmstrips instead of grids, and modalinterfaces are common mistakes found in some of the currentcrop of products. The filmstrip paradigm is unproductive formost consumers because the filmstrip does not effectivelyuse the available screen space. It is inherently one-dimensional. Doubling the screen size only results in afilmstrip twice as long while the Grid View will be able to Figure 6: The typical video capture/edit process doesshow four times as much content as before. not make good use of the space on a 1600x1200 screen. Video Assets are managed on the right withHowever, despite these faults, there is some evidence that only 8 characters for the name. Unique cursorsfeatures present in our software are showing up in come and go depending on both the mode and wherecommercially available products but each of those features is the mouse hovers. The ubiquitous filmstrip appearscrippled by the presence of the filmstrip legacy. Using these at the bottom.software products is like reading a book without a goodeditor. For instance, there are products that employ shotdetection to create a small grid of video but they do not work on MPEG files. There are MPEG editors but they have no shotdetection so manual intervention is required to define boundaries. There are equivalents to the Asset Manager but they have
  • 11. no sort capability. There are ways to create MPEG files from multiple originals but they are slow (since they needdecompression and recompression), hard-to-learn, and often modal.Nowhere is the design as complete as the one proposed here and even if our software is unsuccessful, it may help to definewhat good video software should do. Most of the software available comes packaged with a hardware solution thatincorporates features of the capture device into the interface. This is like selling a keyboard with a word processor that onlyworks with that keyboard. This paper is an attempt to create an abstraction of what this new media of video on the PCrequires. 7. INTERNET VIDEOThe key problem with video on the Internet is bandwidth. Inorder to transmit video, a streaming server first downloadsbuffers to play while more buffers are downloaded. Thisprevents running out of buffers and stalling the video. Thetime consumed waiting to start a video may be 30 seconds ormore before the first frame of video is seen.Worse, there is no way for the viewer to go faster or jumparound in the stream because the only buffers queued up arethose that would be used sequentially. Once consumed, thebuffers are tossed. If a second viewing of the video isrequested, the streaming server will queue the buffers andtransmit the video again from the beginning with the samedelays as before. Figure 7: A grid of pictures in a web page that allows selecting where to begin the video.One alternative is a grid view full of key frames in an HTMLpage (Figure 7). This would allow the user to select where to begin viewing. The user has still frames in the grid to look atwhile waiting to download buffers. In addition, the grid provides a visual cue that shows what has been just streamed to theclient PC. Once viewed, the user can select only that portion of the video that they want to see again.The grid interface does not solve all the problems of streaming video. First, the video must still be buffered so there aredelays no matter where viewing starts. In addition, unless the web page displays some additional text or graphics, a singleframe may not be sufficient help to the viewer trying to determine where to start the video.Nonetheless, streaming video from web servers is on the rise (typically movie trailers) and a grid view in a web page wouldprovide a better way to get the user more control over what they see while at the same reducing the bandwidth demands onthe network. 8. CONCLUSIONWhat conclusions can we make now that we have experimented with this software for most of a year? First, it is clear thatsome of these attempts to create new formats for the presentation and use of video on the PC should receive furtherexperimentation, preferably in the marketplace or some format which enlists consumer input. The market potential for asuccessful tool is enormous because so many people have PCs and so many will have video on those PCs in the comingyears. Video on the PC is a subject that has universal appeal.The behavior of real people using video on the PC is a sociological question that properly should be answered by studies andhigh-volume market surveys. It is possible for us to present data describing how many individuals preferred one format orfeature to another but engineers do not make the best marketing analysts. Such a market survey is beyond the scope of thispaper which is intended only to present the alternatives.A better methodology is available to each of us through introspection and experimentation. This has been the approach usedin this paper - to design a tool that is intended for use at home with our own home videos. The feedback loop is similar tothat of a tinkerer who will keep trying until it is right.Even without any consumer surveys or experimental data, some conclusions stand out as potentially valuable contributions tothe use of video on the PC. The first conclusion is that shot detection is vital to whatever process follows. The grid of key
  • 12. frames provides a scaffold to support other structures that may be built. A second conclusion is that the grid view with ahealthy respect for the efficient use of screen space is an efficient way to present video. It provides the mechanism forrandom access as well as an efficient way to sense what is in the video content.It remains to be seen if building on the metaphor of word processing is useful to consumers. Will edit, cut, and paste beimportant interactions with video? Will consumers just want to watch the video or will random access matter? The behaviorof frustrated channel surfers is an indication that more interactions with video content are desirable and should be pursued.Regardless of what happens, it is important to jettison the metaphor of film and treat the PC as a new kind of medium forvideo. The video document concept represents a historically unencumbered way of pursuing the future. The features of avideo document that are most likely to be embraced are: • Shot detection provides a simple way to organize video content. • A grid presentation of key frames is an efficient overview and creates the mechanism for random access. • The word processing metaphors of edit, cut, and paste build on existing, well-understood uses of the PC. • MPEG files are the medium for exchange to save disk space. • Seamless playback while editing is an essential WSIWYG feature. • The ability to create new MPEG files from edited views is essential for sharing to other venues. • Video content needs some form of asset management, preferably sorted in a variety of ways. • Individual frames can be searched, copied, and pasted elsewhere. • A self-installing playback tool should be provided for CD-ROM images for sharing to other PCs. • The date on the video should be automatically captured, either via OCR or directly from the DV.The above features are viewed as essential to enhancing the use of video on the PC. These additional features are lessconvincing features of the video document that need further exploration and experimentation: • Some methods for creating an abstract can be useful for exporting to other playback environments. • The anti-jitter playback of poor quality video is desirable but a more satisfying result should still be attempted.Some other areas for further work include: • We completely neglected titling in our implementations. This was a practical choice guided by the demands of research. It is reasonable to expect that users will want to place titles in their videos. • Providing some transitions and special effects would mark time changes or editing cuts. Abstracts would benefit from the visual cue of a transition effect. 9. REFERENCES1. J. S. Boreczky and L. A. Rowe. Comparison of Video Shot Boundary Detection Techniques. In Storage and Retrieval for Still Image and Video Databases IV, Proc. SPIE 2664, pp. 170-179, Jan. 1996.2. A. Dailianas, R. B. Allen, P. England: Comparison of Automatic Video Segmentation Algorithms. In Integration Issues in Large Commercial Media Delivery Systems, Proc. SPIE 2615, pp. 2-16, Oct. 1995.3. A. Hampapur, R. C. Jain, and T. Weymouth. Production Model Based Digital Video Segmentation. Multimedia Tools and Applications, Vol. 1, No. 1, pp. 9-46, Mar. 1995.4. R. Lienhart. Comparison of Automatic Shot Boundary Detection Algorithms. In Storage and Retrieval for Image and Video Databases VII, SPIE Vol. 3656, pp. 290-301, Jan. 1999.5. B.-L. Yeo and B. Liu. Rapid Scene Analysis on Compressed Video. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 5, No. 6, pp. 533-544, December 1995.6. R. Zabih, J. Miller, and K. Mai. A Feature-Based Algorithm for Detecting and Classifying Scene Breaks. Proceedings ACM Multimedia 95, San Francisco, CA, pp. 189-200, Nov. 1995.7. H. Li, D. Doermann and O. Kia. Automatic Text Detection and Tracking in Digital Video. IEEE Trans. on Image Processing. To appear.8. R. Lienhart. Automatic Text Recognition for Video Indexing. In Proceedings of the ACM Multimedia ‘ (Boston, 96, MA, USA, November 11-18, 1996), S. 11-20, November 1996.
  • 13. 9. R. Lienhart and W. Effelsberg. Automatic Text Segmentation and Text Recognition for Video Indexing. Technical Report TR-98-009, Praktische Informatik IV, University of Mannheim, May 1998. To appear in ACM/ Springer Multimedia Systems Magazine.10. A. K. Jain and S. Bhattacharjee. Text Segmentation Using Gabor Filters for Automatic Document Processing. Machine Vision and Applications, Vol. 5, No. 3, S. 169-184, 1992.11. T. Sato, T. Kanade, E. Hughes, and M. Smith. Video OCR for Digital News Archives. IEEE Workshop on Content- Based Access of Image and Video Databases (CAIVD98), Bombay, India, January, 1998.12. V. Wu, R. Manmatha and E. M. Riseman. Finding Text in Images. In Proceedings of Second ACM International Conference on Digital Libraries, Philadelphia, PA, pp. 23-26, July 1997.13. R. Lienhart and B.-L. Yeo. Automatic Abstraction of Home Video Footage into Shorter Video. Submitted to IEEE Transactions on Multimedia.