White Paper - Mpeg 4 Toolkit Approach


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

White Paper - Mpeg 4 Toolkit Approach

  1. 1. White paper SCALABLE MEDIA PERSONALIZATION Amos Kohn September 2007ABSTRACTUser expectations, competition and sheer revenue pressures are driving rapid development—andoperator acquisition--of highly complex media processing technologies.Historically, cable operators provided ―one stream for all‖ service in both the analog and digital domains.At most, they provided two to three streams for East and West Coast delivery. Video on Demand (VOD)represented a first step toward personalization, using personalized delivery, in the form of ―pumping‖and network QAM routing, in lieu of personalization of the media playout itself. In some cases,personalized advertisement play-lists were also created. This resulted in massive deployments of VODservers and edge QAMs.The second step in this evolution is the introduction of switched digital video, which takes the lineardelivery one step further to deliver a hybrid VOD/linear experience without applying any personal mediaprocessing. Like previous personalization approaches, user-based processing is limited to networkpumping and routing, with no access to the actual media or ability to manipulate it for truepersonalization.True user personalization requires the generic ability to perform intensive media processing on a peruser basis. As of today, a STB-based approach to media personalization seems to be dominant. Thisapproach necessitates future deployment of more capable (thus more expensive) STBs. This approach,although straight-forward, is incompatible with the need to lower costs, unify user experience, and retaincustomers and other operator needs. The network approach, where per-user personalization iscompletely or partially accomplished BEFORE the video reaches the STB (or any other user device)delivers the same experience but has been explored only in a very limited fashion. However, thisapproach has the most potential to benefit operators as it addresses most of the current and futurechallenges that operators face. 1
  2. 2. NETWORK-BASED PROCESSING TOOLKITThe following defines a set of coding properties that are used as part of the media personalizationsolution. As indicated below, one of the advantages of this solution is that it is standard-based, as are thetools. The properties defined here are a combination of MPEG-4 (H.264 mostly) and MPEG-2. Thecombination provides a solution for both coding schemes.MPEG-4 is composed of a collection of "tools" built to support and enhance scalable compositionapplications Among the tools discussed here are shape coding, motion estimation and compensation,texture coding, error resilience, sprite coding and scalability.Unlike MPEG-4, MPEG-2 provides a very limited set of functionality for scalable personalization. Thetools defined in this document are nevertheless sufficient to provide personalization in the MPEG-2domain.Object-Based Structure and SyntaxContent-based interactivity, The MPEG4 standard extends the traditional frame-based processingtowards the composition of several video objects superimposed on a background image. For theproper rendering of the scene without disturbing artifacts on the border of video objects (VO), thecompressed stream contains the encoded shape of the VO representing video as objects rather than invideo frames, enables content-based applications. This, in turn, provides new levels of contentinteractivity based on efficient representation of objects, object manipulation, bit stream editing andobject-based scalability.An MPEG-4 visual scene may consist of one or more video objects. Each video object is characterizedby temporal and spatial information in the form of shape, motion and texture. The visual bit streamprovides a hierarchical description of a visual scene. Start codes, which are special code values, canaccess each level of the hierarchy in the bitstream. The ability to process objects, layers and sequencesselectively is a significant enabler for scalable personalization. Hierarchical levels include:  Visual Object Sequence (VS): MPEG-4 scene may include any 2-D or 3-D natural or synthetic objects. Those objects and sequences can be addressed individually based on the targeted user.  Video Object (VO): A video object is linked to a certain 2-D element in the scene. A rectangular frame provides the simplest example, or it can be an arbitrarily shaped object that corresponds to an object or background of the scene.  Video Object Layer (VOL): Video object encoding takes place in one of two modes, scalable or non-scalable, depending on the application represented in the video object layer (VOL). The VOL provides support for scalable coding.  Group of Video Object Planes (GOV): Optional in nature, GOVs enable random access points into the bitstream by providing points where video object planes are independently encoded. MPEG-4 video consists of various video objects, rather than frames, allowing a true interactivity and manipulation of separate arbitrary object shape object with efficient scheduling scheme to speedup real-time computation. 2
  3. 3.  Video Object Plane (VOP): VOPs are video objects sampled in time. They can either be sampled independently or dependently by using motion compensation. Rectangular shapes can represent a conventional video frame. A motion estimation and compensation technique is provided for interlaced digital video such as video object planes (VOPs). Predictor motion vectors for use in differentially encoding a current field coded macroblock are obtained using the median of motion vectors of surrounding blocks or macroblocks which will support high system scalability. Figure 1 below illustrates an object-based visual bitstream. A visual elementary stream compresses visual data of just one layer of one visual object. There is only one elementary stream (ES) per visual bitstream. Visual configuration information includes the visual object sequence (VOS), visual object (VO) and visual object layer (VOL). Visual configuration information must be associated with each ES. Figure 1: The visual bitstream formatCompression TOOLSIntra Coded VOPS (I-VOPS): VOPS that are coded with information within the VOP, removing some ofthe spatial redundancy. Inter coding makes use of temporal redundancies between frames by themethod of motion estimation and compensation: two modes of inter coding are provided for - predictionbased on a previous VOP (P-VOPs) and prediction based on a previous VOP and a future VOP (B-VOPs). These tools are use in the content preparation stage to increase compression efficiency, errorresilience, and coding of different types of video objects.Shape coding tools: MPEG4 provides tools for encoding arbitrary shaped objects. Binary shapeinformation defines which portions (pixels) of the object belong to the video object at a given time, and isencoded by a motion compensated block-based technique that allows both lossless and lossy coding.The technique allows for accurate representation of object that in turn improved accuracy of quality of 3
  4. 4. the final composition, as well as assist the differentiation between video and non video objects within thestream.Sprite coding: Sprite is an image composed of pixels belonging to a video object visible throughout avideo sequence and an efficient and concise method for representation of background video object,which is typically compressed with the object-based coding technique. Sprite has high compressionefficiency when a video frame contains the whole background that is at least visible once over a videosequence.MPEG4 H.264/AVC Scalable Video Coding (SVC): A method of achieving high efficiency of videocompression is the scalable extension of H.264/AVC, known as scalable video coding or SVC.A scalable video bitstream contains the non-scalable base layer and one or more enhancement layers.(The term ―Layer‖ in Video Coding Layer (VCL) is related to syntax layers such as: block, macroblock,slice, etc., layers). The basic SVC design can be classified as layered video codec. In general, the coderstructure as well as the coding efficiency depends on the scalability space that is required by anapplication. An enhancement layer may enhance the temporal resolution (i.e. the frame rate), the spatialresolution, or the quality of the video content represented by the lower layer or part of it. The scalablelayers can be aggregated to a single transport stream, or transported independently.Scalability is provided at the bitstream level, allowing for reduced complexity. Reduced spatial and/ortemporal resolution can be obtained by discarding NAL units (or network packets) from a global SVC bit-stream that are not required for decoding the target resolution. NAL units contain motion information andtexture data. NAL units of Progressive Refinement (PR) slices can additionally be truncated in order tofurther reduce the bit-rate and the associated reconstruction quality. 4
  5. 5. NETWORK BASED PERSONALIZATION CONCEPTNetwork-based personalization represents an evolution of the network infrastructure. The solutionincludes devices which allow multi-point media processing, enables the network to target any user withany device with any content. In this paper, we are focusing primarily on the cable market and TVservices. However, the concept is not confined to these areas.The existing content flow remains intact regardless of how processing functionality is extended withineach of the network components, including the user device. This approach can accommodate the rangeof available STBs, employ modifications based on user profiles, and support a variety of sources.The methodology behind the concept anticipates that the in-and-out point of the system must support avariety of containers, formats, profiles, rates and so forth. However, within the system, the manipulationflow is unified for simplification and scalability. Network-based personalization can provide service toincoming baseline (Low Resolutions), Standard Definition (SD) and High Definition (HD), formats andsupport multiple containers (such as Flash, Windows Media, Quicktime, MPEG Transport Stream andReal).Network personalization requires an edge processing point and optionally, an ingest and user premise ascontent manipulation locations. The conceptual flow of the solution is defined in figure 2 below. Interact Prepare Integrate Create Present Asset SessionFigure 2: Virtual Flow: Network based personalizationThe virtual flow and building blocks defined is generic and can be placed at different locations of thenetwork, co-located or remote. Specific examples of architecture will be reviewed later in this paper. 5
  6. 6. )At the ―preparation‖ point, media content is ingested and manipulated in several aspects: 1 Analysis ofthe content and creation of relevant information (metadata), which will then accompany it across the flow.2) Processing of the content for integration and creation, which includes manipulation such as changingformat, structure, resolution and rate. The outcome of the preparation stage is a single copy of theincoming media, but in a form that includes data that will allow the other blocks to create multiplepersonalized streams from it.The ―integration‖ point is a transition point from asset focus to session focus. The block is all aboutconnecting, synchronizing prepared media streams with instructions and other data to create a completesession specific media and data flow, to be provided later to the ―create‖ block.―Create‖ and ―present‖ blocks are the final content processing steps where for a given session eachmedia stream is crafted according to the user, device and medium (in the ―create‖ block), then joined toa visual experience at the ―present‖ block. The ―create‖ and ―present‖ blocks are intentionally definedseparately, in order to accommodate different end user device types and power. Further discussion ofthis subject appears in the ―Power to the user section‖ below. 6
  7. 7. PUTTING IT ALL TOGETHERThe proposed implementation of network-based personalization takes into account the set of tools andthe virtual building blocks defined above to create the required end result.To support high level personal session-based services we propose to utilize the MPEG-4 toolkit whichenables scene-related information to be transmitted together with video, audio and data to a processor-based network element in which an object-based scene is composed based on user device renderingcapabilities. Using MPEG4 authoring tools and applying BIFS (Binary Format for Scenes) encoding atthe content preparation stage the system will support efficiency enhancement of personalization streamprocessing , specifically at the ―create‖ and ―present‖ stages. Different encoding levels are required tosupport the same bitstream; for example varied network computational power will be required to processthe foreground, background and other data such as 2D/3D in the same in the same bitstream. Moreover,some of the video rendering will be passed directly to the user reception device (STB) and will reducenetwork image processing requirements.The solution described in this paper utilizes a set of tools allowing the content creator to build multimediaapplications without any knowledge of the internal representation structure of an MPEG-4 scene. Byusing an MPEG4 toolkit, the multi-media content is object-oriented with spatial - temporal attributeswhich can be attached to it, including BIFS encoding scheme. The MPEG4 encoded objects addressvideo, audio and multimedia presentations such as 3D as defined by the authoring tools.The solution is built on four network elements: Prepare, integrate, create and present. All four networkelements work together to ensure the highest processing efficiency and accommodate different servicescenarios such as legacy MPEG2 set top boxes; H.264 set top boxes with no object-based renderingcapabilities and finally, STBs with full MPEG4 object-based processing capabilities. Two-way feedbackbetween the STB, the edge network and the network-based stream processor will be established inorder to define what will be processed in each of the network stages.PREPAREAt the prepare stage, the assumption is that incoming content is received or converted to supportMPEG4 toolkit encoding, generating content media in object based format. Using authoring tools toupload content and create scene-related object information will support improved media compressionthat will be transmitted and processing by the network. The object based scene will be created usingMPEG4 authoring tools and applying BIFS (Binary Format for Scenes) encoding to support theintegration and control of different audio/visual and synthetic objects seamlessly in a scene.Compression and manipulation of visual content using MPEG4 toolkit introduces novel concept of aVideo Object Plane (VOP) and a sprite. Using video segmentation, each frame of an input videosequence can be segmented into a number of VOPs, each of which may describe a physical objectwithin the scene. A sprite coding technique may be used to support a mosaic layout. It s based on largeimage composed of pixels belonging to a video object visible throughout a video segment. It capturesspatio-temporal information in a very compact way.Other tools also might be used at the prepare stage applied to improve the network processing andreduced bandwidth, those includes I-VOPs - "Intra-coded Video Object Plane" that allow 7
  8. 8. encoded/decoded based on its shape, motion and texture. Bidirectional Video Object Plane (B-VOP)may be used to predict from a past and a future reference VOP for each object or shape motion vectorbuilt from neighbouring motion vectors that were already encoded.The output of the prepare stage is, per asset, set of object based information, coded as elementarystreams, packetized elementary streams and metadata. The different object layers and data can in turnbe transported as independent IP flows, over UDP, RTP, to the Integrate stage.INTEGRATEThe session with the preparation stage will be an "object-based" session which is embodied mainly inits visualization of several visual object types. The scalable core profile is required mostly because itsupports arbitrary-shaped coding, temporal/spatial scalability, etc. At the same time, the scalable coreprofile will need to support computer graphics, such as 2D mesh, synthetic objects, etc. as part of therange of scalable objects in the integration stage.MPEG4 object-based coding allows separate encoding of foreground figures and background scenes.Arbitrary shaped coding needs to be supported to maintain the quality of the input elements. Itincludes shape information in the compressed stream.In order to apply stream adaptation to support different delivery environments and availablebandwidths, temporal and spatial scalability are included in the system. Spatial scalability allowsaddition of one or more enhancement VOL (video object layers) to the base VOL to achieve differentvideo scenes.To Summarize, at the integrate stage, a user composed out of multiple incoming object based assets,to create a the final, synchronized, video object layers and object planes. The output of the integrateincludes all the info and media require for the session; however at this point the media is still not tunedto the specifics of the network, device and user, it is a super set of it. The streams will than betransport to the ―create‖ and ―present‖ stages, where the final manipulation is done.CREATEThe system part of MPEG-4 allows creation or viewing of a multimedia sequence with hybridelementary streams, which can be encoded and decoded with the best suitable codec for eachstream. However, to manipulate those streams synchronously and compose them onto a screen inreal time is computationally demanding. Therefore a temporal cache will be used in the ―create‖ stageto store the encoded media streams. All of the ES (elementary streams) consist of either a multiplexed(using the MPEG-4 defined FLEXMUX) stream or a single stream, but all of them have beenpacketized by the MPEG-4 SL (sync layer). The uses of FLEXMUX and sync layer will allow groupingof the elementary streams with a low multiplexing overhead at the ―prepare‖ and ―Integrate‖ stages,where the SL will be used to synchronize bitstream delivery information from the previous stage to the―create‖ stage.In order to generate the relevant session (stream) the ―create‖ stage will use an HTTP submission toask for a desired media presentation. The submission will only contain the index of the preformattedBinary Format for Scenes - BIFS for those of a pre-created and stored presentation or a text-baseddescription of the user’s authored presentation. BIFS coding also allow integration and control ofdifferent audio/video objects seamlessly in a scene. The ―integrate‖ stage will receive the request andwill send the media to the ―create‖ stage, i.e. the BIFS stream together with the object descriptor in the 8
  9. 9. form of an initial object descriptor stream. The MPEG-4 BIFS will allow integration and control ofdifferent audio/video objects seamlessly in a scene.If the client side can satisfy the decoding requirements, it will send a confirmation to the ―create‖ stageto start the presentation; otherwise, the client will send its decoding and resolution capabilities to the―create‖ stage. At this point it will repeatedly downgrade to a lower-profile until it meets the decodingcapabilities or will inform the ―present‖ stage to compose a stream that will satisfy the client decodingdevice (i.e. H.264 or MPEG2).The ―create‖ stage will initiate the establishment of the necessary sessions for the SD (scenedescription) stream (BIFS format) and the OD (object description) stream referenced with the userdevice. It will allow the user device to retrieve the compressed media stream by using the URLcontained in the ES descriptor stream in real time. The BIFS is used to lay out the media elementarystream in the presentation, as it provides the spatial and temporal relationship of those objects byreferencing their ES_IDs.If the ―create‖ stage needs to modify the received scene, such as by adding an enhancement layer tothe current scene based on user device or network capabilities, it can send a BIFS update commandto the ―integrate‖ stage and obtain a reference to the new media elementary stream.The ―create‖ stage can handle multiple streams and sync between different objects and between thedifferent elementary streams of a single object (e.g., base layer and enhancement layer). Thesynchronization layer is responsible for synchronizing the elementary streams. Each SL-packetconsists of an Access Unit (AU) or a fragment of an AU. An AU needs to have time stamps forsynchronization and constitutes the data unit that will be consumed by the decoder at the ―create‖stage or the user device decoder. An AU consists of a Video Object Plan (VOP). Each AU will bereceiving by the decoder at the time instance specified by a Decoding Time Stamp (DTS).The media is processed by the ―present‖ stage in such a way that MPEG objects are transcoded toeither an H.264 or MPEG2 transport stream utilizing stored motion vector information and macroblockmode. The applicable process is defined based on user device rendering capabilities. When anadvanced user device with MPEG4 object layer decoding capabilities is the target, the ―present‖processor acts as a stream adaptor, resizing where composition will be performed by the client device(advanced STB).PRESENTThe modularity of the coding tools, expressed as well-known MPEG profiles and levels, allows foreasy customization of the ―present‖ stage for a selected segment. For example, MPEG2 legacy STBmarkets where full stream composition needs to be applied at the network vs. full MPEG4 sceneobject-based advanced set top box capability where minimum stream preparation will need to beapplied by the network ―resent‖ stage.Two extreme service scenarios might be applied as follows:Network-based ―present‖: The ―present‖ function applies stream adaptation and resizing; composes thenetwork object elements; and applies transcoding functions to convert MPEG4 file-based format to eitherMPEG2 stream-based format or MPEG4/AVC stream-based format. 9
  10. 10. STB based ―present‖: The ―present‖ function might path through to the network the object elements afterrate adaptation and resizing to be composed and presented by the advanced user deviceThe ―present‖ functionality is based on client/network awareness. In general, media provisioning will bebased on metadata that will be generated by the client device and the network manger. Metadata willinclude the following information:  Video format. i.e. MPEG2, H.264. VC-1, MPEG4, QT etc.  User device rendering capabilities  User devise resolution format. i.e. SQCIF, QCIF, CIF, 4CIF, 16CIF  Network bandwidth allocation for the session“Present” stage performanceIt is essential that the ―present‖ function be composed of object-based elements that use the defined setof tools which present binary coded representation of individual audiovisual objects, text, graphics, andsynthetic objects. It composes Visual Object Sequence (VS), Video Object Layer (VOL) or any otherdefined tool to a valid H.264 stream or MPEG2 stream in the resolution and the BW as it defined bythe client device and the network metadata feedback.The elementary streams (scene data, visual data, etc.) will be received at the ―present‖ stage from the―create‖ system element which allows scalable representations, alternate coding (bitrate, resolution,etc.), enhanced with metadata and protection information. An object described by an ObjectDescriptorwill be sent from the content originator i.e. the ―prepare” stage, and provides simple meta-datarelated to the object such as content creation information or chapter time layout. This descriptor alsocontains all information related to stream setup, including synchronization information or initializationdata for decoders.The BIFS (Binary Format for Scenes) at the ―present‖ stage will be used to place each object, withvarious effects potentially applied to it, in a display which will be transcoded to an MPEG2 or H.264stream.STB-based ―present‖: Object reconstructionThe essence of MPEG4 lies in its object-oriented structure. As such, each object forms an independententity that may or may not be linked to other objects, spatially and temporally. This approach gives theend user at the client side tremendous flexibility to interact with the multimedia presentation andmanipulate the different media objects. End users can change the spatial-temporal relationships amongmedia objects, turn on or shut down media objects. However, it will require difficult and complicatedsession management and control architecture.A remote client retrieves information regarding the media objects of interest, and composes apresentation based on what is available and desired. The following communication messages betweenthe client device and ―present” stage will occur:  The client requests a service by submitting the description of the presentation to the data controller (DC) at the ―present‖ stage side.  The DC on the ―present‖ stage side controls the encoder/producer module to generate the corresponding scene descriptor, object descriptors, command descriptors and other media 10
  11. 11. streams based upon the presentation description information submitted by the end user at the client side.  Session control on the ―Create‖ stage side controls the session initiation, control and termination.  Actual stream delivery commences after the client indicates that it is ready to receive and streams flow from the ―Create‖ Stage to the ―Present‖ client. After the decoding and composition procedures, the MPEG-4 presentation authored by the end user is rendered on his or her display.It is required that the set top box client support the architectural design of the MPEG4 system decodermodel (SDM), which is defined to achieve media synchronization, buffer management, and timing whenreconstructing the compressed media data.The session controller at the client side communicates with the session controller at the server (―Create‖Stage) side to exchange session status information and session control data. The session controllertranslates the user action into appropriate session control commands.Network-based MPEG4 to H.264/AVC baseline profile transcodingTranscoding from MPEG4 to H.264/AVC can be done in the spatial domain and compressed domain.The most straightforward method is to fully decode each video frame and then completely re-encode itwith H.264. This approach is known as spatial domain video transcoding. It involves full decoding and re-encoding and is therefore very computationally intensive.Motion vector refinement and an efficient transcoding algorithm are used for transcoding the MPEG4object-based scene to a H.264 stream. The algorithm exploits the side information from the decodingstage to predict the coding modes and motion vectors of H.264 encoding. Both INTRA macroblock(MBs) transcoding and INTER macroblock transcoding will be exploited by the transcoding algorithm atthe ―present‖ stage.During the decoding stage, the incoming bitstream is parsed in order to reconstruct the spatial videosignal. During the decoding process, the prediction direction for INTRA coded macro blocs and motionvectors are stored and then used in the coding process.To get the highest transcoding efficiency by the ―present‖ stage, side information will be stored. Duringthe decoding process of MPEG4, a lot of side information (like MVs) is obtained. The ―present‖ stagereuses the side information, which reduces the transcoding complexity compared to a full decode/re-encode scenario. In the process of decoding the MPEG4 bitstream, the side information is stored andused to facilitate the re-encoding process. In the transcoding process both MV estimation and codingmode decisions are addressed by reusing the side information to reduce complexity and computationpower.Network-based MPEG4 to MPEG2 transcodingTo support legacy STBs that have limited local processing capabilities and support only MPEG2transport streams, a full decode-encode will be performed by the ―present‖ stage. However, the ―present‖stage utilizes tools that have been used for the conversion of MPEG4 to H.264 in order to removecomplexity. Stored motion vector information and macroblock mode decision algorithms for inter-frameprediction based on machine learning techniques will be used as part the MPEG4 to MPEG2 transcodeprocess. Since coding mode decisions take up the most of the resources in video transcoding, a fastmacro block (MB) mode estimation would lead to reduced complexity. 11
  12. 12. The implementation presented above has the ability to incorporate in offline and realtime environment.See appendix 2 for elaboration on real time implementation. 12
  13. 13. BENEFITS OF NETWORK-BASED PERSONALIZATIONDeploying network-based processing, whether complete or hybrid, has significant benefits:  A unified user experience is delivered across the various STB’s in the field;  It is a presentation, future-proof cost model for low to high-end STBs.  It utilizes existing the VOD environment, servers and infrastructure. Network-based processing accommodates low-end and future high-end systems, all under existing, managed operators’ on- demand systems. Legacy servers require more back office preparation, with minimal server processing power overhead, while newer servers can provide additional per-user processing and thus more personalization features.  Rate utilization is optimized. Instead of consuming the multiplication of all streaming comprised in the user experience, network optimized processing reduces overhead significantly. In the extreme case, it may be a single stream with no overhead, instead of 4-5 times the available BW. In the common case, it has overhead of approximately 20%.  Best quality of service fpr connected home optimization. By performing most or all the processing before hitting the home, the operator optimizes the bandwidth and experience across the user end devices, delivering best quality of service.  Prevention of subscriber churn in favour of direct over-the-top (OTT). The operator has control over the edge network. Over-the-top providers do not. Media manipulation in the network can and will be done by OTT operators. However, unlike cable operators, they do not have control over the edge network, limiting the effectiveness of their action, unless there is a QOS agreement with the operator, in which case control stays in the operator’s hands.  Maintaining the position of current and future ―smart pipe‖. Being aware of the end-user device and processing for it is critical for the operator to maintain processing capabilities that will allow migration to other areas such as mobile and 3D streaming. 13
  14. 14. IMPLEMENTING NETWORK-BASED PERSONALIZATIONAs indicated earlier in the document, the solution can be implemented in a variety of ways. In thissection, we present three of the options, all under a generic North America on-demand architecture. Thethree options are: Hybrid network edge and back office; Network edge; and Hybrid home network.Hybrid network edge and back officeAs the user device powers up or the user starts using personalization features, the user client connectswith the session manager, identifies the user, his device-type and his personalization requirements. andonce resources are identified, starts a session. In this implementation the ―prepare‖ function is physicallyseparated from the other building blocks, and the user STB is not capable of relevant video processing/rendering. Each incoming media is processed and extracted to create it for downstream personalizationas part of the standard ingest process. Once a session is initiated and the edge processing resourcesare found, sets of media and metadata flows are propagated across the internal CDN to the ―integrate‖step at the network edge. The set of flows include the different media flows, related metadata (whichincludes target STB-based processing, source media characteristics, target content insertion information,interactivity support and so forth. The metadata needs to be available for the edge to start processing thesession), objects, data from content provider/ advertiser and so forth.After arrival at the edge, the ―integrate‖ function aligns the flow and passes it to the ―create‖ and―present‖ functions, which in this case, generate a single, personally composed stream, accompaniedwith relevant metadata, directed at a specific user. Back Office Region Edge Curve User IP IP IP HFC HFC Analog, Broadcast Realtime Offline Wired App Session Servers Manager Edge QAM Legacy STB Media Over Broadband AMS, UERM Prepare CDN Legacy Media Integrate Compose Wireless PresentFigure 3: Hybrid back office and network edge 14
  15. 15. As can be seen from Figure 3 above, the SMP (Scalable Media Personalization) session manager isconnecting between the user device and the network, influencing in real time the ―integrate‖, ―create‖ and―compose‖ edge functions.Network edge onlyThis application case is about doing all the processing on-demand, in real time. It is similar to the hybridcase; however, instead of the ―prepare‖ function being located at the back office and working offline, allfunctions in this case are on the same platform. As can be expected this option has significanthorsepower requirements for the ―prepare‖ function, since content needs to be ―prepared‖ in real time. Inthis example, the existing flow is almost seamless, as the resource manager simply identifies it asanother network resource and manages it accordingly. Back Office Region Edge Curve User IP IP IP HFC HFC Analog, Broadcast Realtime Offline Wired App Session Servers Manager Edge QAM Legacy STB Media Over Broadband Generic AMS, UERM CDN Prepare Ingest Legacy Media Integrate Compose Present WirelessFigure 4: Network Edge 15
  16. 16. Hybrid Home and NetworkIn the hybrid implementation, the end user device (STB in our case) was identified as one that is capableof hosting the ―present‖ function. As a result, as can be seen from Figure 5, the ―present‖ function isdislocated from the user home, while the system demarcation is the ―create‖ function. During thesession, multiple ―prepared‖ flows of data and media will arrive to the STB, taking significantly lessbandwidth versus the non-prepared options and consuming reduced CPU horsepower as part of the―present‖ function. Back Office Region Edge Curve User IP IP IP HFC HFC Analog, Broadcast Realtime Offline Wired App Session Servers Manager Edge QAM ADV STB Media Over Broadband AMS, UERM Prepare CDN Legacy Media Integrate Compose WirelessFigure 5: Hybrid Home and Network 16
  17. 17. POWER SHIFTING TO THE USERAlthough legacy STBs are indeed present in many homes, the overall processing horsepower at thehome is growing and will continue to grow. That means that the user device will be able to do moreprocessing at home and theoretically less in need of network-based assistance. At first glance this isindeed the case. However, when the subject is delved into further, two main challenges revealthemselves. 1. The increase in user device capabilities and actual user expectations, comes back to the network as a direct increase in bandwidth utilization, which then reflects back on users’ experience and ability to run enhanced applications such as multi-view. For example, today’s next generations STBs support 800MIPS to 16000 MIPS versus the legacy 20 to 1000 MIPS, with dedicated dual 400Mhz video graphics processors and dual 250-MHz audio processors (S-A/Cisco’s next-gen Zeus silicon platform). In Figure 6 below, the expected migration of media services into other home devices such as media centres and game consoles significantly increases available home processing power. Processing Roadmap [TMIPS] 3 2.5 2 1.5 1 0.5 0 2007 2008 2009 2010Figure 6: Home Processing Power Roadmap 2. No matter how ―fast and furious‖ processing power is in the home, users will always want more. Having home devices perform ALL the video processing increases utilization of CPU and memory and directly diminishes the performance of other applications.In addition, as discussed earlier in the document, the increase in open standard home capabilitiessubstantially strengthens the threat of customer churn for the cable operators.Network-based personalization is targeted at providing solutions to the above challenges. The approachis to use network processing to help the user, improving his experience. 17
  18. 18. By performing the ―prepare‖, ―integrate‖ and ―create‖ functions in the network, and leaving only the―present‖ function to the user home, several key benefits are delivered which effectively address theabove challenges.Network bandwidth utilization: The ―create‖ function drives down network bandwidth consumption.The streams that are delivered to the user are no longer the complete, original media as before, butrather only what is needed. For example, when looking at 1 HD and 2 SD in the same multi-view window,each of the three streams will have the correct resolution and frame rate required at each given moment,resulting in significant bandwidth savings, as can be seen in Figure 7. Bandwidth to the home example (1HD, 2SD) 18 16 14 12 STB Only [Mbps] 10 Hybrid [Mbps] 8 6 Network Only [Mbps] 4 2 0 MPEG2 H.264Figure 7: 2SD, 1HD bandwidth to the homeCPU Processing power: As indicated in the ―putting it all together‖ section, our solution allows for objectlayer selective composition. Also, the actual multi-view is created out of multiple resolutions and thusthere is no need for render-resize-compose functions at the user device, which in turn reduces theoverall CPU utilization.Finally, the fact that the network can deliver the above benefits inherently drives power back to the handsof the operator, who can deliver the best user experience. 18
  19. 19. SUMMARYExceeding user expectation while maintaining a viable business case is becoming more challenging thanever for the cable operator. As the weight is shifted to the home and broadband streaming, the operatoris forced to find new solutions to maintain leadership in the era of personalization and interactivity.Network base personalization provides a balanced solution. The ability to maintain an open, standardbased solution, while being able to dynamically shift the processing balance based on user, device,network and time, can provide the user and operator a ―golden‖ solution.REFERENCES Ahmad, X. Wei, Y. Sun and Y.-Q. Zhang, "Video Transcoding: An Overview of Various Techniques and Research Issues," IEEE Transactions on Multimedia, Vol. 7, No. 5, pp. 793-04, Oct. 2005. ISO/IEC JTC 1/SC 29/WG 11, "Information technology-Coding of audio-visual objects, Part8: Carriage of MPEG-4 contents over IP networks (ISO/IEC 14496-8)― Jan. 2001. Ishfaq Ahmad Dept. of Computer Science and Engineering The University of Texas Arlington, TX ―MPEG-4 To H.264/AVC Transcoding‖. Haining Liu, Xiaoping Wei, and Magda El Zarki - ―Real Time Interactive MPEG-4 Client-Server‖ ISO/IEC JTC 1/SC 29/WG 11- ―MPEG-4 Terminal Architecture‖ ISO/IEC JTC1/SC29/WG11 – ―CODING OF MOVING PICTURES AND AUDIO‖ ITU-T – ―The Advanced Video Coding Standard‖ MPEG Video Group, Description of Core Experiments in SVC, ISO/IEC JTC1/SC 29 WG 11 Document N6898, 2005 John Watkinson ―THE MPEG HANDBOOK‖ABOUT THE AUTHORAmos Kohn is Vice President of Business Development at Scopus Video Networks. He has more then20 years of multi-national executive management experience in convergence technology development,marketing, business strategy and solutions engineering at Telecom and new multimedia emergingorganizations. Prior to joining Scopus, Amos Kohn held a senior position at ICTV, Liberate Technologiesand Golden Channels. 19
  20. 20. APPENDIX 1: STB BASED ADDRESSABLE ADVERTISINGIn the home addressable advertising model, multiple user profiles in the same household are offered toadvertisers within the same ad slot. For example, within the same slot, multiple targeted ads will replacethe same program feed targeted at different ages of youth while another advertisement may target theadults at the house (male, female) based on specific profiles. During the slot, youth will see one ad whilethe adult will see another ad. Addressable advertising require more bandwidth to the home thentraditional zone-based advertisements. Granularity might step one level up, where the targetedadvertisement will target the household and not the user within a household. In this case, less bandwidthwill be required in a given serving area in comparison to the userbased targeted advertisement. Theimpact of home addressability on the infrastructure of channels that are already in the digital tier andenabled for local ad insertion will be similar to unicast VOD service bandwidth requirements.In case of a four demographics scenario, for each ad zone, four times the bandwidth that has beenallocated for a linear ad will need to be added.APPRENDIX 2: REALTIME IMPLEMENTATIONProcessing in real time is defined by stream provisioning (fast motion estimation), stream complexity andthe size of the buffer at each stage.The scenes as compositions of audiovisual objects (AVOs), support of hybrid coding of natural videoand 2D/3D graphics, and provision of advanced system and interoperability capabilities support real timeprocessing.MPEG4 real time software encoding of arbitrarily shaped video objects (VO) is an important key in thestructure of the solution. The MPEG4 toolkit unites the advantages of block and pixel-recursive motionestimation methods in one common scheme, leading to a fast hybrid recursive motion estimation whichsupports MPEG4 processing. 20