Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Hierarchical resource allocation for robust in-home video streaming Peter van der Stok1,2, Dmitri Jarnikov1,2, Sergei Kozlov1, Michael van Hartskamp2, Johan Lukkien1 1, Eindhoven, Technical University, Netherlands; 2, Philips Research, Eindhoven, Netherlands Abstract High quality video streaming puts high demands on network and processor resources. The bandwidth of the communication medium and the timely arrival of the frames necessitate a tight resource allocation. Given the dynamic environment where videos are started and stopped and electro-magnetic perturbations affect the bandwidth of the wireless medium, a framework is needed that reacts timely to the changes in network load and network operational conditions. This paper describes a hierarchical framework, which can handle the dynamic network resource allocation in a timely manner. 1. Introduction Today, TVs, video recorders and set-top boxes are mostly interconnected with SCART cables. The advent of broadband and the introduction of the second PC in the home make digital home networks a more realistic way to interconnect Consumer Electronic devices. This tendency moves video away from dedicated media to video streaming across an open and shared network connecting multiple types of devices (e.g. phones, PCs, and CE-devices). This new multimedia environment introduces the problem of sharing limited network resources between video- and other applications. Both the need for resources, expressed as bit rate, and the availability of resources, expressed as bandwidth, fluctuate within intervals of tens of milliseconds. In addition there is a need for a timely delivery of the video at the destination. The source of the video can be either live (broadband TV or a camera) or taken from a storage medium. For a live transmission, a low overall delay from generation to displaying is mandatory, which imposes strict timeliness requirements. The source can be located inside the home, outside, or connected through some gateway (e.g. a broadband connection). The quality of the source video can vary from relatively poor - in the order of tens of kbits/s - for use on a small display, to high quality - in the order of 6 to 40 Mbits/s - for use on a large screen flat TV. The sharing of the network medium among several applications leads to a lower bandwidth available to a given video. In addition, a significant part of the home network will be based on wireless technology. The stok-elsevier-jss-4 1
  2. 2. consequences of wireless connections are reduced security and bandwidth, as well as increased fluctuation of the bandwidth through interference with other transmission sources and the moving of objects. Not meeting the resource and timeliness requirements leads to non-optimal viewing experiences in the form of distortion, hiccups, delayed viewing or stalling. To avoid these severe quality changes (leading to people refusing to buy networks and TVs) we propose a scheme that allocates the resources in such a manner that under overload a tolerable quality degradation occurs such that recognizable video is provided at all times. The scheme combines the video-source, the video coding, and the transport protocol and is especially advantageous for live broadcasts. It distinguishes fast fluctuations at the frame level (≈ 40 ms) from structural fluctuations. Fast fluctuations are caused by variations in the frame sizes and distortions in the bandwidth. Structural fluctuations last longer and come for example from the starting/stopping of another application. It is not sufficient to come up with technologically viable solutions. Because many manufacturers provide network equipment and CE-devices, inter-operability must be assured. The situation in the home is very different from telephone- or service provider networks. The networks for the latter are under control of one operator who decides the resource management procedures and technology. In the home there is no such authority, and standardization must assure that the provided technology collaborates to support the policies wished by the users of the home network. 2. Related work Most CE-devices (TV, DVD) are resource-constrained to put them in an acceptable price range. For telephones, the same resource constraints are mostly driven by battery power constraints. The work on video streaming most related to our work can be distinguished in three areas (1) scheduling of packets on the link, (2) adaptation of processing power to video requirements in the renderer, (3) multicasting video with a bit- rate adapted to the individual capacities of the receivers, and (4) transporting video over a network. Scheduling packets. The packets of all applications, which share the network, need to be scheduled. It is assumed that a network authority (as foreseen by UPnP QoS standardization) allocates bandwidth to the individual applications. To prevent buffer overflows and provide a balanced loading of the network, packets are scheduled such that the consumed bandwidth is not exceeded and the network load is evenly distributed in time. The unscheduled packets are viewed as interference to the scheduled applications. Leaky bucket and stok-elsevier-jss-4 2
  3. 3. token bucket techniques are two well-known examples . General Processor Sharing (GPS) is an ideal scheduling mechanism, which is applied to networks. Two examples show how scheduling techniques can allocate bandwidth to streams or separate asynchronous traffic from synchronous (video) traffic . Processing power in CE-devices. In , and the observation is made that the CPU needs for a video stream fluctuate from frame to frame and from scene to scene. A distinction is made between the fast fluctuation at the frame level, and the slower fluctuation at the video scene level. The concept of Scalable Video Algorithm (SVA) is introduced to adapt the quality of the decoding process to control the CPU requirements. In this framework it is possible to provide the highest quality while still meeting the deadline of each frame. In the authors describe the allocation of priorities to video frames such that important video frames, on which other frames depend, have a higher probability of acquiring CPU resources to maintain the highest possible video quality under processing overload. Multicasting. Devices have different processor/memory capabilities, thus not every device may be capable of processing all video data that is streamed by the sender. To ensure that all devices process video according to their capacity, a sender sends to each destination the amount of data that the device can successfully process. In the sender adapts the content. There are several strategies for content adaptation. The three foremost of those strategies are the simulcast model, transcoding and scalable video coding model. With the simulcast approach the sender produces several independently encoded copies of the same video content, which differ in e.g. bit/frame rates, and spatial resolutions. The sender delivers these copies to the destinations, in agreement with the specifications coming from the destination. Video transport. Two protocols are generally deployed for the transport of video over a network with the Internet protocol (IP) facilities defined by Internet Engineering Task Force (IETF): Transmission Control Protocol (TCP) , and Real-time Transport Protocol (RTP) . RTP is very successful on the Internet to support live video but the quality displayed on the PC is far below the quality accepted by buyers of digital TV. RTP promotes timely arrival of packets by allowing loss of packets. Efforts are ongoing to extend RTP with retransmission facilities as provided by TCP . TCP provides loss-less transport of packets but badly supports live streaming over the Internet. The TCP-RTM protocol is an effort to provide timeliness to audio packets by skipping late packets . stok-elsevier-jss-4 3
  4. 4. The fluctuating bandwidth of the wireless medium has as consequence that the quality of the rendered video fluctuates. In a controller at the receiver side removes the quality fluctuations by selecting the transmitted video parts such that the quality fluctuations are reduced. Standardization for the home. Only a few years ago, IEEE 1394 and HiperLAN were considered as promising candidates for connectivity in consumer electronics home networking as they offered timely delivery of packets and are therefore suited for multimedia transport. Now wired and wireless Ethernet are recognized as the predominant connectivity standards. The advantage is that only one technology is used for all networking applications in the home (e.g. file transport, chatting, audio, and video). Yet, it offers only best-effort but no timeliness guarantees. The IEEE 802.11e standard , which provides extensions to wireless Ethernet, offers prioritized and scheduled access. It also offers several other enhancements. Before the IEEE 802.11e standard was completed in 2005, the Wi-Fi Alliance had started a certification program for wireless multimedia based on IEEE 802.11e. The program for prioritized access, called Wi-Fi Multimedia (WMM) , was completed in 2004. The Wi-Fi Alliance is currently working on a certification program for scheduled access. Several other connectivity technologies have recently been developed that include scheduled access: WiMedia HomePlug AV , etc. Even for wired Ethernet, the Ethernet AV initiative intends to improve its timeliness properties. As it is expected that home networks remain heterogeneous in their connectivity technologies, middleware solutions are developed to deal with this heterogeneity. One of the more popular middleware technologies for the home is UPnP . The UPnP forum standardizes so-called device control protocols for e.g. AV applications, Internet Gateway devices, but also for Quality of Service (QoS). The UPnP-QoS v1 and ongoing v2 specifications define the use of priority-based policies. Currently work in UPnP-QoS is ongoing to develop version 3 on parameterized QoS and scheduled access. The UPnP-QoS makes it possible to share bandwidth in accordance to application quality criteria. As such it becomes possible to share the network between different types of applications. For example, the real-time aspects of audio and video streaming can be guaranteed at the expense of delays for file transport The Digital Living Network Alliance (DLNA) is an industry forum that provides interoperability guidelines to implement digital media servers, players and renderers . Guidelines are written on the use of wired and stok-elsevier-jss-4 4
  5. 5. wireless Ethernet and Bluetooth, TCP/IP and UPnP. TCP is the mandatory transport protocol for AV content. The DLNA also defines various profiles for AV media formats . 3. Video transport This section motivates our choice of TCP and the deployment of scalable video coding. Figure 1 shows the most important features of the network configurations we consider. receiver 1 stored video switch a) sender AP ... real-time video receiver N video data sender application b) TCP IP data traffic shaper Figure 1 Example network configuration The sender contains a video application, which sends stored or live video to a destination on the network. It invokes a transport protocol, which packs the video frames in packets and the traffic shaper sends the packets in a regular fashion to the network. The network is composed of a wired (switched Ethernet) part and a wireless part. Packets are buffered and sent on in the switch and the Access Point (AP). Losses due to buffer overflow may occur at the sender, the switch, the AP and the receiver. For today’s wired segment there is almost no packet loss. In addition, measurements showed that losses over the wireless segment do not occur when the retry counter is set to 4 or higher . Consequently we may assume that packet loss over the communication media in the home is negligible, and most of the time losses occur due to buffer overflow. 3.1. Transport protocols When unreliable transport protocols (such as RTP, ) are used for sending multimedia streams over a large network (the Internet), it is very difficult to control the data losses happening due to congestion in the routers or due to the low reliability of the medium, e.g. a wireless link. Usual practice to handle such losses is to use error recovery mechanisms at the receiver and/or redundancy coding at the sender. However, these stok-elsevier-jss-4 5
  6. 6. mechanisms are often combined with some content adaptation technique, which uses the feedback from the network or receiver to inform the sender about the losses. This makes the system difficult to implement. TCP, being a reliable transport protocol, eliminates uncontrollable losses of data. Applications built upon TCP see the network as reliable transport means with varying throughput. Nevertheless, if at some moment the application needs to send more data than TCP can deliver (the network bandwidth drops below the bit rate of a live encoded stream), loss of data can happen due to application/TCP buffer overflows. Introducing larger buffers may decrease the probability of buffer over- and under-flows, because often the network throughput drops below the video bit-rate only for a short time, after which the “recovery” takes place. The larger the buffers, the longer the periods of insufficient bandwidth can go unnoticed by the end-user. The cost for the large buffers is increased latency (the time needed for a unit of data – e.g. a video frame – to be transferred from the sender to the receiver). Latency of more than 200 ms, which corresponds to buffering of 5-7 frames, is not acceptable in real-time video applications. Keeping the buffers small, limits the amplitude and duration of bandwidth variations that can be handled. However, these losses are easy to control by applying buffer management techniques that are different from the default Tail-Drop technique. Such techniques as Partial Buffer Sharing (PBS) and Triggered Buffer Sharing (TBS) , as well as Push out Buffers (POB) are based on dropping lower priority data to accommodate higher priority data when the buffer cannot accommodate both. Implementations of buffer management techniques in the video-streaming domain address frame skipping approaches and scalable video coding methods (see below). TCP selection. The major drawback of TCP is its stalling behavior and its slow start-up after an end-to-end packet loss. However our measurements (section 5.1) indicated that end-to-end packet losses occur rarely in home networks and are always immediately restored by acknowledgements resent within 20 ms. Consequently, stalling behavior is almost completely eliminated. On the positive side, the flow control of TCP adapts the bit rates of the packets to the bandwidth availability. For live video, packets are lost in the sender buffer of the application, but the same application has access to this buffer and can decide which parts can be removed. In contrast, RTP just goes on sending packets leading to uncontrolled losses. stok-elsevier-jss-4 6
  7. 7. 3.2. Video frames A MPEG-2 video stream is built up of I, P and B frames. Each frame represents one picture. An I-frame contains enough information to be decoded independently. A P-frame needs additional information from a directly preceding I-frame or P-frame. Motion vectors describe how a part of the referenced picture must be moved for a correct visualization in the frame to decode. B-frames need additional information from two frames, a succeeding P- or I-frame and another preceding P- or I-frame. The video is structured in Groups of Pictures (GOP), containing one I-frame followed by a sequence of B- and P-frames. Examples of legal GOPs are I(I), IPP(I), IBPBPB(I) or IBBPBBPBB(I). The (I) denotes the start of the next GOP. A scalable video coding scheme describes an encoding of video frames into multiple layers, including a Base Layer (BL) of basic quality and several Enhancement Layers (EL) containing increasingly more video data to enhance the quality of the base layer and thus resulting in video of increasingly higher quality . Scalable video coding is represented by variety of methods that could be applied to many existing video coding standards . These methods are based on principles of temporal, signal-to-noise ratio (SNR), spatial and data partition scalability . In our framework we use a specific form of temporal scalability that we call I-Frame Delay (IFD) and a form of SNR scalability that is resistant against packet losses. I-Frame Delay. IFD represents a temporal scaling technique. When the network bandwidth drops below the bit rate of the video, temporal scalability decreases the bit rate of the video by dropping video frames without influencing the quality of the surviving frames. A reasonably low amount of dropped frames might not be noticeable by the end-user. However, dropping frames arbitrarily (as it would be in case of Tail-Drop buffer handling) is not a good idea because the impact of dropping MPEG frames has an impact on the end-user perceived quality dependent on the frame type (I, P, or B). We use the frame type to guide the frames skipping process as follows: when the sender buffer gets full, IFD will push the B frames out of the buffer first, and then, (i.e. the bandwidth has dropped significantly for a longer period), the P and I frames. The cumulative weight of B frames in a MPEG-2 stream often comes to 50% and more. This means that by only dropping all B frames we can make the resulting example video stream fit into a bandwidth that is half the bit rate of the original stream, still preserving inter-frame dependencies. The price for this will be a stok-elsevier-jss-4 7
  8. 8. decreased frame-rate - in the case of an IBBPBB(I) GOP structure, all B frames dropped would lead to 1/3 of the original frame-rate and 1/2 of the original bit rate. SNR scalability. In Figure 2 two possible structures for the enhancement layer are shown. The arrows indicate the dependence of the frames on each other. A cross suggests the loss of a particular frame in a layer. A horizontal thick arrow indicates the loss of a frame in a layer dependent on the lost part indicated with a cross. The BL has a normal standard GOP structure. Using a GOP structure that has P and B frames in the EL is dangerous from a reliability point of view. If the network condition is bad, there is a high risk of losing a frame during the transmission. In this case if the lost frame is I or P, the receiver will not be able to decode the rest of the GOP, which will lead to a considerable loss of frames in current and upper enhancement layers (Figure 2a). In our coding scheme, the enhancement layer is formed from the residuals of the frames from the base layer. That means no dependency between different frames inside the enhancement layer. Therefore the loss of any frame from an enhancement layer will not influence subsequent frames (Figure 2b). EL (I) - EL (B) - EL (B) - - - EL (P) EL (B) EL (B) EL (I) EL EL EL - EL EL EL EL EL (I) - EL (B) - X - - EL (B) EL (P) EL (B) EL (B) EL (I) EL EL EL XEL EL EL EL BL (I) BL (B) BL (B) BL (P) BL (B) BL (B) BL (I) BL (I) BL (B) BL (B) BL (P) BL (B) BL (B) BL (I) a) MPEG-2 standard b) modified SNR scalability Figure 2 Two SNR enhancement layer structures Choosing TCP together with IFD or SNR scalable video makes it possible to remove a selected part of the video at the source. The percentage of video to be removed at the source is determined by the bandwidth. 3.3. Example management Without too much loss of generality we explain the framework techniques by looking at the transmission of a continuous stream over network of Figure 1. The wireless channel retransmits packets until they arrive at the sender, simultaneously adjusting the bit rate of the channel dependent on the packet loss rate. stok-elsevier-jss-4 8
  9. 9. 8 7 Ba 6 nd wi 5 dth (M bp 4 s) 3 2 1 0 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 4 4 4 4 4 4 4 4 4 4 Time (sec) Figure 3 bandwidth fluctuations for a given stream plus control indication. Figure 3 shows an example of the bandwidth fluctuations as perceived by a TCP stream sent at maximum packet rate. The first 5 seconds the TCP stream is the single user of the link. From 5 to 10 seconds an additional file transport shares the wireless link, from 10 to 15 seconds – interference is added, then from 15 to 20 seconds the second stream stops and finally, after 20 seconds, the interference stops as well. Every 40ms the number of arrived bits is measured, and divided by 40 ms to obtain the effective bit rate of the TCP stream with a sampling interval of 40 ms. Using Figure 3 some important observations can be made. 1. To exploit the available bandwidth to its fullest, the video bit rate curve should retrace the measured curve shown in Figure 3. However, this is impossible in practice due to, among others, the variable bit rate of the video, use of fixed-sized video layers, inertia of the transcoder, lack of calculation power etc. Even when the video bit rate follows the available bandwidth, the end-user is confronted with an unpleasant perception of frequent quality changes . Therefore, the notion of quality level is introduced. In our case, for simplicity, quality level is fully determined by the video bit rate. 2. We could change the quality level based on feedback, which triggers the source to change to another quality level when appropriate. In Figure 3 we base this triggering on the changes of the two average bandwidth values denoted with dashed and solid lines. However, answering the question when the quality change should change does not answer the question with what value the quality level should change. A worst-case (dashed line) or a more optimistic guaranteed level (solid line) are possible. stok-elsevier-jss-4 9
  10. 10. 3. Using the pessimistic dashed line, the video gets through with maximum probability. However, the drawback is the low effectiveness of the bandwidth usage. In Figure 3 we see that due to the fluctuations in the intervals [0,5) and [20,25), the worst-case dotted line is 1 Mbit/s below the measured bit rate, while in interval [9,14) the bandwidth fluctuation becomes so high, that the worst case scenario brings us no video at all. The solid line depicts a video quality level close to the measured value. The price for this is an occasional loss of data. Two techniques are used to keep the effects of losses low: (1) layered scalable video and (2) I- Frame Delay (IFD) 4. Management framework The management is hierarchically ordered. At the highest level, bandwidth allocation is done to permit bandwidth sharing between videos. At the next level, the bit rate of the video is adapted to the available bandwidth. At the lowest level less important packets are dropped to assure that more important packets arrive in time at the destination. video data sender application application EL-N video data EL-1 ... vido data with adapted bit -rate scalable prio transcoder scheduler TCP BL C W S time time spec spec IP data QoS QoS manger traffic shaper layer device configurator feedback Figure 4 Sender refinement Figure 4 shows the structure of the sender. The original video enters the application. Inside the application a transcoder transforms the single layer video into a layered scalable video. The bandwidth allocation algorithm specifies the sum of the sizes of the layers. The size of the individual layers and the number of layers are determined by the bandwidth fluctuations. The video layers are presented to TCP, which fragments the frames into packets. A traffic shaper outputs the packets on the link. stok-elsevier-jss-4 10
  11. 11. 4.1. Slow fluctuations Two types of slow fluctuations are considered; (1) user interaction to increase or decrease the quality of a video stream or to start/stop a video stream, and (2) adaptation of the sizes of the video layers in response to changes in bandwidth availability for example coming from physical stimuli. User interaction The UPnP QoS working groups prescribe the elements, which distribute the bandwidth allocation decisions over the network. Each device holds a QoSDevice module, which receives from the UPnP QoS manager instructions on the bandwidth it may use (see Figure 4). The traffic shaper takes time windows in which it sends packets according to the prescription received from the QoS manager. Layer configurations. Two modules in our framework are involved in determining the number of layers and the size of each layer: (1) scalable transcoder, and (2) layer configurator (see Figure 4). Scalable transcoder. The transcoder converts non-scalable video into multi-layered scalable video. The layer configuration may be changed at run-time. The input to the transcoder is provided via a reliable channel, thus assuming that there are no losses or delays in the incoming stream. Layer configurator. The layer configurator chooses number and bit-rates of layers based on the acquired information about network conditions and receiver’s decoding capability. The network information is used to estimate the currently available network bandwidth, fluctuations and errors. The receiver’s decoding capability is used to define the maximal number of layers that can be handled by the receiver. 4.2. Fast fluctuations The scalable video solution makes it possible to react to fast changes by dropping enhancement layers for a given frame. The layers should be chosen such that the bit-rate of BL is less than the available bandwidth to assure that BL always arrives. Taking this to its logical consequence means a very small BL. A small BL yields a very low quality that is difficult to repair with larger EL . Therefore, the BL is chosen as high as possible. To counteract the fast bandwidth fluctuations two components are employed (1) I-Frame Delay (IFD) algorithm and (2) layered frame scheduler. IFD. Our experiments show that impressive improvements (compared to the default Tail-Drop technique) can be achieved with only two buffers, which accommodate each 1 video frame. Let us mark the frames in the buffer as follows: S – the frame that is being transmitted (and is partially sent), W – the frame waiting in the stok-elsevier-jss-4 11
  12. 12. buffer. The frame, which is offered for transmission by the application, will be marked as C. When W is present, which means that we cannot buffer any more frames, and C is arriving from the application, the scheduling algorithm decides which of the two frames (C and W) is least important for the end-quality to decide which one to discard. The following algorithm favors I frames and P frames over B frames: WHILE (TRUE) DO WHILE (C is empty) DO Nothing IF (W is empty) THEN Store C in W ELSE IF (C is of type I) THEN Overwrite W with C ELSE IF (C is of type B) Discard C ELSE IF (W is of type I or P) Discard C ELSE Overwrite W with C A Boolean is added to the algorithm to drop all frames to the next I-frame when a P-frame is dropped. Start with BL No Is it BL? Yes Check buffer Check BL buffer Yes Is buffer No Is packet No Take packet Send packet empty? outdated? No Is buffer Yes Delete packets in buffer full? for a frame with lowest priority Yes Choose next layer Drop packet Put packet into buffer Figure 5 Transmission of packets from sender buffer (left), and filling sender buffer (right) Frame scheduler. The scheduling combines the layered scalable video with the IFD temporal approach at the sender. IFD and layered scalable video can be used independently and in isolation. The combination supports larger fluctuations in bandwidth. However, there is a minimum bandwidth of 1 Mbit/s, associated with the 802.11 technology. Layers of a scalable video are sent according to a priority scheme. Since BL information is absolutely necessary, the BL has the highest priority. The priority of each EL decreases with increasing layer number. When a frame from ELx is being transmitted and a frame from BL arrives, the sender sends the BL frame after the transmission of the current packet belonging to ELx (if any). When a frame from ELx arrives, it preempts a frame from ELy (where y>x) in a similar fashion (see Figure 5). When the channel bandwidth has become lower than the total video bit rate, the sender buffer gets full. To prevent sending late packets, we introduce a maximum lifetime for EL packets. If the maximum is reached, the packet is deleted from the buffer. BL packets are removed by IFD independent of their life time. stok-elsevier-jss-4 12
  13. 13. 4.3. Choosing layers The layer configurator uses a table that is created off-line to choose the most appropriate layer configuration as function of the network conditions. For a predefined set of network conditions we estimate (1) loss probability per layer for each layers configuration, and (2) the average quality that can be delivered by this layers configuration (by looking at loss probabilities and calculating the SNR quality of the video). A fixed maximum number of layers is used per network condition. If the decoding capacity of the receiver is lower than the suggested BL-value, part of the bit-rate of BL is reassigned to the first EL. For example, if an optimal configuration for a given network condition yields a BL of 4 Mbps, an EL 1 of 2 Mbps, and an EL2 of 2 Mbps and due to device requirements the BL should be limited to 1 Mbps, then the BL bit-rate is set to 1 Mbps and EL1 bit-rate is set to 5 Mbps. Offline, a network simulation environment creates strategies for the layer configurator as shown in Figure 6. The environment consists of five major components: frame size generator, packet generator, sender/prioritizer, wireless channel simulator, and receiver/quality calculator. The frame size generator produces normally distributed random values for frame sizes based on the stream bit-rate, assuming that the mean size of a frame in a stream is equal to the bit-rate divided by frame rate. These values are passed to the packet generator, which formats an incoming data stream into a set of packets based on video stream syntax and the network protocol specification. The packets are buffered and sent over in accordance with their priority by the sender/prioritizer. In accordance with MAC level retransmissions of 802.11-like protocols, we allow a fixed number of retransmissions for a packet that is lost. The module also uses a maximum lifetime for packets, so the outdated packets are deleted from the buffer. The Gilbert model was used for the insertion of errors into the transmission channel of the channel simulation module. The module, based on the description of the network condition, expressed in average available bandwidth, error rate and burstiness of errors, calculates the amount of error-free packets, dropped packets and frames. Corrupted packets are dropped (a packet is considered corrupted if at least one bit of the packet is wrong). A complete frame is dropped when at least one packet of the frame is dropped. stok-elsevier-jss-4 13
  14. 14. Layer configuration Frame size (number of layers, bit-rates of layers) generator Frame sizes Packetization Packet scheme generator Number and size of Buffer sizes, packets per stream packets lifetime Sender / number of prioritizer retransmissions Packets with Wireless priorities Network conditions (average bandwidth, channel error rate, burstiness) simulator Packet and Frame Receiver / error rates INPUT quality calculator Average PSNR as a function of OUTPUT network condition and layer configuration Figure 6 Network simulation environment (input in italic is implementation specific) Finally, the packets of a given frame, transmitted over the channel, are merged together into a single frame in the receiver/quality calculator module. The receiver computes how many times corresponding frames from different layers are transmitted successfully. Based on these values and knowing (from predefined data) the mapping between layer size and quality expressed in average Peak Signal to Noise Ratio (PSNR) the module calculates average PSNR for the received video. The layer configuration with the highest average PSNR is considered to be the optimal for the given network condition. 4.4. Interoperability It is important that the framework does not only solve the technical requirement of showing the best possible video as function of transmission conditions and receiver capacity, but also provides a high level of interoperability. The UPnP and DLNA standards and recommendations govern all interaction between devices on the network. The global problem of sharing network resources between applications is solved within the context of the standard. The problem of optimizing perceived video quality is solved entirely within the sender. Consequently, it is possible to apply the framework solutions within an interoperable framework, still allowing the manufacturers to improve the quality of their own senders. 5. Evaluation Evaluation is done in two parts. Section 5.1 shows the validity of our choice for TCP. Section 5.2 shows how the framework handles the variations in bandwidth coming from fluctuating operational conditions. stok-elsevier-jss-4 14
  15. 15. 5.1. Video streaming with TCP over wireless medium The measurements presented in this paper are a selection from the measurements described in . The measurement setup is as follows. A PC with a wireless card is used as sender. The PC sends video over IEEE 802.11b to an Access Point (AP). The AP is connected with switched Ethernet to the receiver PC which renders the video. The following transmission protocols are compared: (A) Unblocking UDP, (B) blocking UDP, (C) RFC 2250 , which describes packetizing mpeg, over blocking UDP, (D) TCP and (E) TCP with IFD. A MPEG-2 video with duration of 60 seconds was streamed from sender to renderer with 4 different bit rates, 3, 4, 5 and 6 Mbit/s. The MPEG-2 video specifies that 25 frames are sent per second i.e. 40 ms between each frame. For all transmission protocols, frames are sent with the bit rate of the video or with a rate limited by the bandwidth of the wireless channel. When the bandwidth is smaller than the bit rate of the video, the effective bit rate was reduced at the sender such that the duration of the video increased beyond the original 60 seconds for transmission protocols B, C, and D. For transmission protocol A (unblocking UDP), packets were lost inside the driver of the sender in an uncontrolled fashion, limiting the duration to 60 seconds. Transmission protocol A is rejected for that reason. For transmission protocol E (TCP with IFD) the rendered video duration remained equal to 60 seconds while losses were controlled. This is explained in more detail below. The maximum effective transmission rates for B is 6 Mbit/s, for C is 4.5 Mbit/s and for D is 5 Mbit/s. The effective transmission rate of protocol C (RTP, the official standardized video protocol) is lower than B and even D (TCP) because the packet overhead is larger and not all packets are completely filled. (a) (b) Figure 7 (a) Throughput for TCP versus wireless retry value and (b) latency of TCP versus video bit rate stok-elsevier-jss-4 15
  16. 16. One of the properties of the wireless link is that the sending of a packet is immediately acknowledged at the link layer. When the sender receives no acknowledgement, the wireless frequency rate is lowered and the packet is resent. The retry value determines the number of times a packet can be resent before it is definitely lost. Figure 7(a) shows the throughput of TCP with different retry values of the wireless link. On the horizontal axis the time during video transmission is shown. After 20 seconds the microwave is switched on, and switched off after 40 seconds. When the retry value of the wireless link is set to one (no retransmissions) the TCP protocol needs to resend all lost packets and we see that the throughput is below 1 Mbit/s. Not visible in the figure is that with a retry value of four almost all packets arrive and TCP retransmits only 3 to 4 packets during the full 60 seconds. With blocking UDP we see the same dip between 20 and 40 seconds with the difference that the maximum transmission rate is 4.5 Mbit/s, no end-to-end retransmission takes place and frames are lost. Figure 7(b) shows the latency of the frames with respect to the time they should have arrived at the sender, given the arrival time of first frame. A retry value of 4 is used, meaning that no losses occur during wireless transmission. A latency of 80 msec is acceptable with a reception buffer of two frames. For bit rates of 3 and 4 Mbit/s the video transmission is stopped after 60 seconds. During the interval [20,40) for video bit rate 3 Mbit/s a delay builds up and disappears once (probably associated with a TCP retransmission) and for video bit rate 4 Mbit/s two delays build up and are removed later on. For bit rate 6 Mbit/s the situation is dramatic, an enormous latency builds up during the whole transmission period. For bit rate 5 Mbit/s latency builds up during the microwave on period. The same type of behavior can be measured with blocking UDP, with the exception that for video bit-rate of 5 Mbit/s the latency builds up as for the video bit rate of 6 Mbit/s, and video frames are lost from time to time. A latency value larger than 80 msec. means that the video is stalled during several moments. In case of live video, frames would even be lost at the sender buffer, because video cannot be delayed, contrary to the conditions in this experiment. The total effect on the viewer would be disastrous. The IFD protocol is activated on top of TCP to trade latency against frame losses in a controlled fashion. Figure 8 shows the latency versus time with wireless retry value equal to 4 (TCP retransmissions < 5) for different video bit rates. For all bit rates the maximum latency remains below 110 msec. This means that live stok-elsevier-jss-4 16
  17. 17. video is transmitted in time even when overload conditions occur. The 110 msec means that a given frame can be rendered two to three times consecutively when frames are dropped. This dropping leads to jerky movements at times. During the 60 seconds period of transmission of the 6 Mbit/s video, with a 20 seconds microwave perturbation, no I-frames are removed, 3 P-frames are removed and 314 B frames are removed (total 20% of frames are removed) the consequence on the video is no artifacts, sometimes a jerky movement, but all rendered frames are not more retarded than 110 msec. Figure 8 Latency of TCP with IFD for different video bit rates The IFD protocol can also be activated on top of UDP. However, the disadvantage is that when a second wireless segment is introduced (e.g. sender to AP and from AP to renderer), IFD must also be applied for the following segment (e.g. in the Access Point). From an interoperability point of view this is difficult to realize in practice (manufacturers have to agree on the specifics of the IFD protocol). Given all these measurement, the solution of TCP coupled with controlled frame losses seems the best solution to obtain timely live video under restricted network resources, because the TCP throughput is highest coupled with the least chance of artifacts in the rendered video. 5.2. Framework behavior An example behavior is based on Figure 3. Each interval experiences network conditions that are different from both predecessor and successor intervals. The wildly moving line represents the bandwidth of the channel sampled with 40 ms intervals. stok-elsevier-jss-4 17
  18. 18. Interval BL (Mbps) EL (Mbps) BL loss rate % EL loss rate % 1 4 1 0 4.4 4.25 1 0.1 12.5 3.75 1 0 1.1 2 3.5 1 0.2 9.8 3.25 1 0 4.2 3.75 1 0.7 19.5 3 1.5 1.25 6.5 26.7 1.5 1 6.5 20.8 1.75 1 8.8 26.9 1.25 1.5 5 26.6 4 2.25 1 2 17.9 2 1 0.8 11.7 2 1.25 0.8 17.6 5 4 1 0 4.4 4.25 1 0.1 12.8 3.75 1 0 1 Table 1 Configurations that deliver highest average objective quality under different network conditions Two layers are transmitted. The lower dotted line approximates the average bit rate of the BL. The higher solid line represents the average bit rate of the BL + EL layers. In practice the video bit rate fluctuates around the average value with a deviation of ± 25%. As soon as a change of network conditions is detected, the layer configurator changes the bit rates of the layers. The network conditions of the different time intervals are expressed in error rate and burstiness of errors. Once these parameters are measured the configurator uses a simple look up table to choose a layer configuration. The lookup table is created offline as described in section 4.3. For each of the 5 intervals of Figure 3 the best fit is calculated with one of the network transmission conditions used by the network simulation environment. Table 1 shows three (four for time interval 3) best layer configurations for every time interval (network condition) of our example. The uppermost configuration of each time interval is preferred as the best choice for that time interval. Under good network conditions the second and third configuration in a given interval differ from the best only by the size of BL (plus-minus 0.25 Mbps). A small change in bit rate of the BL produces a more significant difference in quality than a change in an EL. Under poor network conditions BL size is very small (segment 3). So, even a small increase in bit rate for BL brings a huge raise in the value of objective video quality. However, the penalty for the bit rate increase is a high loss rate for BL, which influences the subjective quality value. The acceptable loss range for BL is 5%, so for segment 3 a configuration with BL of 1.25 Mbps and EL of 1.5 Mbps is chosen. stok-elsevier-jss-4 18
  19. 19. 6. Conclusions Streaming video requires a dynamic adaptation of the video to the available network resources and destination resources. Even when the bit rate of the video is in agreement with the average network bandwidth, wireless networks suffer rapid bandwidth fluctuations caused by interference from other electro- magnetic sources, and the differences in frame sizes imply fluctuating bandwidth requirements. This calls for a hierarchical approach to handle at the highest level the slow bandwidth changes, and at a lower level the fast bandwidth changes. At the same time the capabilities of the receiver need to be taken into account to prevent sending data, which cannot be handled at the destination. When nothing is done, the video is rendered with artifacts, which are very annoying to the viewer of the video. A framework is proposed that removes video data to reduce the bit rate of the video in a controlled fashion. A transcoder adapts the bit rate to the slow fluctuations while the fast fluctuations are handled by throwing away layers of SNR scalable streams or remove entire frames when the bandwidth drop is extremely large and sudden. Using TCP provides the following advantage: All the intelligence of the system is concentrated at the sender side. No network protocol adaptation is needed. Presenting solutions for optimizing transmission of video is not enough. The solutions should be presented in an interoperability framework, to be accepted by the CE device manufacturers. The paper shows how such a framework can be integrated within the UPnP and DLNA standardization efforts. Acknowledgements We like to thank Jeffrey Kang and Jan Ouwens for many helpful discussions and valuable input. References [1]R. Haakma, D. Jarnikov, P. van der Stok, Perceived quality of wirelessly transported videos, in Dynamic and Robust Streaming in and between Connected Consumer-Electronic Devices (ed. P. van der Stok), Series: Philips Research Book Series, Vol. 3, 2005 [2]S. Tanenbaum, Computer Networks, 4th ed. Prentice-Hall, 2003. [3]M. Zink et al, Subjective Impression of Variations in Layer Encoded Videos, KOM Multimedia Communications, 2003 stok-elsevier-jss-4 19
  20. 20. [4]Pedro Cuenca et al, Performance Evaluation of Cell Discarding Mechanisms for the Distribution of VBR MPEG-2 Video Over ATM Networks. IEEE Transactions on Broadcasting, 44(2), June 1998 [5]Tao Tian et al, Priority dropping in network transmission of scalable video. International Conference on Image Processing, 3:400-3, Sept. 2000 [6]Dmitri Jarnikov, Peter van der Stok, Johan Lukkien, Wireless streaming based on a scalability scheme using legacy MPEG-2 decoders, Ninth IASTED Int. Conference on Internet & Multimedia Systems & Applications, 2005 [7]C.C.Wust, L.Steffens,R.J.Bril, and W.F.J.Verhaegh, “QoS Control Strategies for High Quality Video Processing”. In Proc. 16th Euromicro Conference on Real-Time Systems (ECRTS), Catania, Italy, 2004. [8]D Isovic and G.Fohler, “Quality aware MPEG-2 Stream Adaptation in Resource Constrained Systems”. In Proc. 16th Euromicro Conference on Real-Time Systems (ECRTS), Catania, Italy, 2004. [9]D. Hoffman, G. Fernando, V. Goyal and M. Civanlar, “RTP Payload Format for MPEG1/MPEG-2 Video. RFC 2250”, Network Working group, Jan. 1998. [10]ISO/IEC International Standard 13818-2, “Generic Coding of Moving Pictures and Associated Audio Information: Video”, Nov., 1994. [11]ISO/IEC International Standard 14496-2, “Information Technology – Generic Coding of Audio-Visual Objects, Part 2: Visual”, MPEG98/N2502a, Oct., 1998. [12]ITU-T International Telecommunication Union, “Draft ITU-T Recommendation H.263 (Video Coding for Low Bit Rate Communication)”, KPN Research, The Netherlands, Jan., 1995. [13]J. R. Yee and J. Edward J. Weldon, ”Evaluation of the performance of error-correcting codes on a Gilbert channel”, IEEE Trans. on Communications, pp. 2316-2323, Aug. 1995. [14]D. Jarnikov, P. van der Stok, C.C. Wust, “Predictive Control of Video Quality under Fluctuating Bandwidth Conditions”. ICME '04, Volume: 2 , pp. 1051 – 1054, June 27-30, 2004 [15]McCanne, S., Vetterli, M., Jacobson, V., “Low-complexity video coding for receiver-driven layered multicast”, IEEE journal on selected areas in communications, vol. 16, no 6, p.983-1001, 1997. [16]Peter Amon, Jurgen Pandel, “Evaluation of Adaptive and Reliable Video Transmission Technologies”, available from http://www.polytech.univ-nantes.fr/pv2003/papers/pv/html/main/all_pap.htm [17]R.J. Bril, C. Hentschel, E.F.M. Steffens, M. Gabrani, G.C. van Loo and J.H.A. Gelissen, “Multimedia QoS in consumer terminals”, Proc. IEEE Workshop on Signal Processing Systems (SIPS), pp. 332-343, Sep. 2001. [18]Yao Wang, Joern Ostermann, and Ya-Qin Zhang, “Video Processing and Communications”, Prentice Hall, 2002. stok-elsevier-jss-4 20
  21. 21. [19]H. Schulzrinne, G.M.D. Fokus, S. Casner, R. Frederick and V. Jacobson. RTP: A Transport Protocol for Real-Time Applications. Internet Engineering Task Force, A/V Transport Working Group, Jan. 1996. [20]J. Postel. Transmission Control Protocol. RFC 793, Information Sciences Institute, September 1981. [21]S. Liang and D. Cheriton., TCP-RTM: Using TCP for Real-Time Multimedia Applications, InfoCom 2001. [22]L. Lenzini, E. Mingozzi, G. Stea, A unifying service discipline for providing rate-based guaranteed and fair queuing services based on the Timed Token protocol, IEEE transaction on Computers, Vol 51, Nr 9 2002. [23]J.C.R. Bennett and H. Zhang, Hierarchical packet fair queuing algorithms, Proc of the ACM SIGCOMM 1996. [24]J. Ouwens, The Performance of Wireless MPEG-2 Video Streaming, Philips Internal note TN-2005/00735. [25]IEEE 1394 standard [26]HiperLAN, http://en.wikipedia.org/wiki/HIPERLAN#HIPERLAN.2F2 [27]IEEE 802.11e standard [28]Wi-Fi CERTIFIED™ for WMM™ - Support for Multimedia Applications with Quality of Service in Wi-Fi® Networks, http://www.wifi.org/membersonly/getfile.asp?f=WMM_QoS_whitepaper.pdf [29]WiMedia, http://www.wimedia.org/en/index.asp [30]HomePlug AV White Paper, http://www.homeplug.org/en/docs/HPAV-White-Paper_050818.pdf [31]Residential Ethernet Overview, Michael Johas Teener, CommsDesign, http://www.teener.com/ResidentialEthernet/ Residential%20Ethernet.pdf [32]UPnP forum, www.upnp.org [33]UPnP Quality of Service specifications, http://www.upnp.org/standardizeddcps/qualityofservice.asp [34]DLNA Interoperability Guidelines v1.5, March 2006 [35]DLNA Media Format Guidelines v1.5 - Volume 2, March 2006 stok-elsevier-jss-4 21