Smooth Quality Streaming of Live Internet Video
                                 Dimitrios Miras and Graham Knight
Temporal Width

                                         Horizontal Width

                                 Vertical Wi...
                            Congestion                                        ...
a method that associates an encoding bit-rate to the resulting encoding quality in realtime is required.
Then, appropriate...
Perceptron                                                                                         ANN with one hidden
where E k is the network error vector for the training pattern xk and K is the total number of training
patterns. The erro...
texture complexity: the pixel activity, PelAct, defined as the standard deviation of luminance pixels in
each (8 × 8) block...
Table 1: Content features extracted from the original video frames
      Content Feature            Description
+                                  +

Actual vs. predicted scores

Absolute prediction error of the ANN
error                                                                                buflev                               ...
receiver buffer underflow. Qtarget then gets close to Qtcpf , thus avoiding a Renc value that is much higher
than the transm...
Smooth Quality Streaming of Live Internet Video
Smooth Quality Streaming of Live Internet Video
Smooth Quality Streaming of Live Internet Video
Smooth Quality Streaming of Live Internet Video
Smooth Quality Streaming of Live Internet Video
Smooth Quality Streaming of Live Internet Video
Smooth Quality Streaming of Live Internet Video
Upcoming SlideShare
Loading in …5

Smooth Quality Streaming of Live Internet Video


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Smooth Quality Streaming of Live Internet Video

  1. 1. Smooth Quality Streaming of Live Internet Video Dimitrios Miras and Graham Knight Dept. of Computer Science, University College London Gower St., London WC1E 6BT Email: {d.miras, g.knight} February 2004 Abstract A live video stream, when encoded and transmitted using a congestion controlled IP flow, experiences a variety of quality of service due variations in video content activity and bandwidth availability. As a consequence, the perceived quality of the video can suffer frequent oscillations which are particularly disturbing to the viewer. We therefore tackle the problem of accommodating the mismatch between the available transmission rate and the encoding rate required for stable perceived quality. By utilizing a reliable metric of perceived quality, we develop a technique for source rate control of real-time live video that maintains a more uniform quality. The method comprises an artificial neural network to generate predictions of the on-going quality and a fuzzy rate-quality controller that considers properties of human perception of quality in order to provide smooth streaming quality. Experimental results indicate that in the presence of sufficient buffering, the proposed adaptation technique can improve quality stability, while maintaining TCP-friendly transmission. 1 Introduction Encoding and transmitting live video over the Internet is subject to significant variations in quality. This is attributed to the video content’s inherently varying spatio-temporal complexity: video scenes with low spatial activity and motion are easier to encode with good quality, while on the other hand, complex visual content and motion increase the distortion introduced by the encoder. Furthermore, in order to avoid congestion in network resources and be fair to compliant flows, media streams need to employ congestion control to determine their fair-share of bandwidth and adapt their transmission rate to match it [3]. Unfortunately, confining the source rate to the TCP-friendly rate of the stream results in frequent fluctuations in video quality. Such variability in quality is extremely annoying to the human viewer; users prefer video with medium but stable quality to a video that oscillates between high and low quality [5]. Most work on smooth quality video in the literature concerns with streaming of stored media and not live video material. In such cases, the system has access to future frames in its disposal to perform encoding optimizations like multiple-pass coding or efficient packet scheduling. Furthermore, measurement of the video quality performance is usually limited to a few objective metrics like mean square error (MSE) and peak signal-to-noise ratio (PSNR) (e.g., [27]). Work in [15] proposes smoothness criteria for layered streams based on layer runs, defined as the number of consecutive frames in a layer. Notwithstanding the fact that frequent oscillations in the number of transmitted layers result in variations in quality, the assumption that layer smoothness coincides with quality smoothness cannot be substantiated. This problem is aggravated by the fact that the algorithm presented works on layered CBR streams; it is known that CBR video exhibits high quality variation. Kim and Ammar [13] extend this work to solve the problem of TCP- friendly streaming of layered FGS MPEG-4 video with minimum quality variation. Again, layer runs are 1
  2. 2. Temporal Width Horizontal Width Vertical Width Fn Fn+1 Fn+2 Fn+3 Fn+4 Fn+5 Figure 1: This figure illustrates the definition of the spatio-temporal (S-T) region used as an indication of smoothness, which preserves the aforementioned disadvantages. Furthermore, the experiments simulated the transmission of high rate streams( 4 Mbps); these rates can support high quality video anyway and are not typical to what the majority of Internet users experience today. This work presents a rate-quality adaptation method for live video that reduces fluctuations of quality. Our method differs from other adaptation techniques that rely on inaccurate metrics to represent quality, in its use of a realistic quality metric. Internal representations of the on-going quality are obtained by an objective video quality metric [25], which is proved to provide ratings highly correlated to human judgements of quality. The basic implementation concepts of this metric are introduced in the next section. In section 3 we introduce the proposed system architecture, describe related terminology and present the main challenges that arise. We firstly recognise that application of an objective quality metric places a computational burden to the streaming server. Therefore, we develop a method for on-line estimation of the ongoing quality, based on an artificial neural network (section 4), that yields accurate predictions of the objective quality. We then present (section 5) a rate-quality controller based on the principles of fuzzy logic that utilises predictions from the neural network and manipulates the encoding rate in order to provide a smooth evolution of quality. Experimental results of the approach are presented therein. 2 Description of the objective quality metric An important issue that arises with video adaptation is to understand its impact on video quality and how to measure it. Pixel-error metrics (like MSE and PSNR) are widely used for this purpose, however, they suffer a major drawback: they do not always correspond well with human judgements of quality [6], especially at low-to-modest bit-rates (up to a few hundreds of Kbps). The main issue with MSE and PSNR is that they cannot discriminate between impairments that humans can and cannot see or impairments that are less or more annoying. Until recently, the most regarded method of measuring the quality of digital video was by means of subjective quality assessment [11]. This method however, requires costly and complex setups and therefore, it is not suitable for on-line quality monitoring. The answer to this problem is the recent emergence of objective video quality metrics (VQM) (e.g., [8, 19, 25, 22, 24, 18]). These are computational models that measure video quality in a way that preserves high correlation to human ratings of quality, by accounting the type and magnitude of perceived distortions in the video signal. Current research on objective quality metrics is at a considerably mature level and several models are under evaluation and approaching standardisation [20]. We implemented and used the ITS VQM [25], proposed at the Institute for Telecommunication Sci- ences, to measure perceived quality. This metric has attained significant performance during a recent 2
  3. 3. Rtcpf(t) TCPF Congestion Qtcpf(t) ANN Control Quality Content Predictor Qtcpf(t-1) Feature Features Extraction Encoded module Frames Encoder R Q(R) - quality error Fuzzy Quality Rate-quality Renc(t) Qtarget(t) Adaptive controller Smoother Input Frames buflev Send/Recv buffer monitor Figure 2: Components of the smooth quality adaptation framework and the interactions between involved modules. evaluation by the Video Quality Experts Group (VQEG) [21]. Its algorithm is based on the extraction and statistical summarization of scalar, spatio-temporal features from the original and degraded video frames, to obtain a single measure of perceived distortion. Summarization of these features occurs within spatio-temporal (S-T) regions, usually 8 × 8 pixels × 6 frames (Figure 1). For each S-T region, features from both the original and distorted frames are compared using functions that resemble human perception and visual masking, to obtain measures that quantify the level of perceptual distortions present (tiling, blurring, motion jerkiness, etc.). The calculated measures of perceptual impairments from each S-T region are then pooled for the temporal duration of the S-T region (6 frames) and summarised by averaging the worst 5% of measured distortions (this reflects the fact that subjective quality is primarily determined by the worst quality during the observation period). A single score of perceptual distortion for the whole video sequence (typically 8-10 sec long) is obtained by averaging the measured impairments of every 6- frame evaluation period (which we call S-T period ) over the duration of the clip. Score values are in the [0, 1] range, with zero corresponding to a sequence with imperceptible impairments and one to a heavily distorted video. If D(t) is the S-T period perceptual distortion value produced by the ITS VQM at time index t, then the value 1 − D(t) can be considered as a reasonable representation of the instantaneous quality for the purposes of realtime quality monitoring. We call this value S-T period quality. S-T period quality scores are scaled to the [0,100] range (with 0 representing unacceptable quality and 100 perfect quality). 3 Problem formulation and proposed architecture Achieving real-time video streaming with consistent quality requires a method that manipulates the source rate so that more bits are allocated to scenes or frames with high spatial and temporal energy. In other words, the problem is how to control the encoding rate, denoted Renc , in the presence of a variable and unknown available transmission rate (the TCP-friendly rate of the stream, denoted Rtcpf ), so that the resulting target quality Qtarget , is a smoothed alternative of the quality that the encoder would have produced if the video rate was set to Rtcpf (denoted Qtcpf ). To enable a quality-based rate adaptation, 3
  4. 4. a method that associates an encoding bit-rate to the resulting encoding quality in realtime is required. Then, appropriate target quality values can be continuously chosen for successive S-T periods. A smoother quality would mean that at times Qtarget is higher than Qtcpf , while at other times it is lower. Consequently, a similar relationship would occur between Renc and Rtarget . Note that the sender is always transmitting at its TCP-friendly rate, therefore, mismatches between the two rates are accommodated using a sender and a receiver buffer. Hence, the system has to maintain buffer stability at the same time. The ITS VQM can be used to obtain continuous S-T period quality scores as described in section 2. Doing so however, requires encoding and decoding of the S-T period frames at several candidate bit- rates, and the subsequent application of the metric. This approach is prohibitive in terms of real-time performance, a strict requirement of live video coding. In order to bypass this time-consuming process, our system utilises an artificial neural network (ANN) to automatically generate accurate predictions of the continuous S-T period quality scores when presented with details of the content features of the input frames and a target encoding rate. S-T quality scores obtained from our implementation of the ITS VQM are used to train the ANN. The details of the ANN quality predictor are presented in section 4. Figure 2 illustrates the architecture and the components of the proposed system. A companion con- gestion control module is periodically sampled to elicit the nominal transmission bit-rate of the stream (Rtcpf ). Although the proposed system is not bound to a specific transmission control policy, we assume TCP-friendly congestion control [4]. The video encoder receives video frames from a live video source (camera, satellite feed, etc.) with the task of producing a compressed bitstream. Every S-T period t, summary content statistics of video features are extracted (cf. section 4) from a small number (six) of consecutive frames. Based on content features statistics, that reflect the complexity of the underlying visual content, and the current nominal transmission rate Rtcpf (t), the neural network generates a predic- tion of the resulting quality, Qtcpf (t). The sampling of the TCP-friendly rate and the estimation of the continuous quality scores are therefore carried out at a period equal to the duration of the S-T period (i.e., every 6 frames, or, 200 ms for a 30 frames per second input video). This period is an efficient tradeoff between a suitable granularity of network adaptation1 and the duration of the quality evaluation period of the ITS VQM. It also minimises the additional delay and buffering requirement at the sender. Finally, a fuzzy rate-quality controller receives successive values of Qtcpf and an estimate of the sender and receiver buffer sizes to determine a value for Qtarget that achieves the desired encoding quality and maintains the stability of both send and receive buffers. The function of the controller is to locate, by further invocations of the ANN, the encoding bit-rate, Renc , that approximates Qtarget . We discuss why a controller based on the principles of fuzzy logic is a good choice for our system, and present how it determines the target quality Qtarget , in section 5. 4 Neural network quality predictor An Artificial Neural Network (ANN) is a general, practical form of machine learning, that provides a robust approach of approximating real, discrete or vector target functions, and learns to interpret complex real- world data. When suitably trained, ANNs can provide accurate estimation of the output(s) based on a selection of inputs, efficiently predict non-linear relationships among multidimensional data and support a general paradigm to deal with complex mathematical functions. Extensive research in this area has resulted into a multitude of approaches to neural network computing; we limit our discussion to the very basic principles the govern an ANN and to the most popular type of ANN, multi-layer perceptron with error back-propagation [7, 17], which is the one used in this work. The basic building block of a neural network is an elementary neuron, or perceptron (Figure 3). Each 1 The network adaptation timescale is dictated by the frequency of incoming acknowledgements of the TCP-friendly protocol. 4
  5. 5. Perceptron ANN with one hidden layer x1 W1 x1 w11,1 ¦ w2 f a1 W2 b1 x2 x3 w1 w3 w2 ¥ f a ¢£¡   ¤ f( n i 1 wi xi b) Input x2 § f a2 w21,1 w21,2 ¦ f ~ y ... Layer b2 wn Output ... x3 w21,m ... b=1 xn Inputs § w2 f Output Layer ... am w1 m,n bm xn Hidden Layer Figure 3: The structure of the basic ANN component – a neuron or perceptron, and a feedforward neural network with n inputs, one hidden layer with m neurons, and one output layer. WL is the weights matrix for layer L. input vector xT = [x1 , x2 , ..., xn ] is weighted with an appropriate weight wi , that defines the contribution of input xi to the perceptron’s output α. The sum of weighted inputs together with a bias b are projected on a differentiable transfer function f , to produce the output of the neuron or, activation n α = f( wi xi + b) i=0 Layers of several perceptrons can be combined to form a multi-layer feedforward network (Figure 3). Feedforward networks often have one or more hidden layers of non-linear neurons. Multiple layers of perceptrons with non-linear transfer functions, like log-sigmoid, or tangent-sigmoid 2 , allow the network to learn linear and non-linear relationships between inputs and output(s), without a-priori assumption of a specific model form. The function of a neural network is to determine suitable values for a set of adjustable parameters, like the weights and biases at every layer and neuron, by performing an iterative procedure, called training or learning, on the set of train samples. These adjustable parameters are given random initial values, and the training process consists of two steps per iteration. For a set of training input vectors with a known response y, a forward pass calculates all the activations at every neuron to generate a predicted response y . Then, a backpropagation step is used to adjust all the weights of the ˜ neural network based on the magnitude of the error between the predicted and actual output 1 E k = (y k − y k )2 ˜ (1) 2 K E= Ek (2) k=1 2 1 logsig(x) = (1 + βe−x ) ex − e−x tansig(x) = ex + e−x 5
  6. 6. where E k is the network error vector for the training pattern xk and K is the total number of training patterns. The error cost measure in expression 1 is commonly used for its simplicity and it presents the deviation of the network’s output from the ideal. The task of the training is to find the weights and biases that minimise E. This iterative procedure with new optimised parameters is repeated until an acceptable low error is achieved. There are several algorithms proposed to adjust the weights at every iteration of the training phase, and the gradient descent is probably the most popular [2]. Essentially, this method performs iterative steps in the weight space, proportional to the negative gradient of the cost function E to update the weights wij = wij + ∆wij ∂E ∆wij = −η ∂wij ∂E ∂E k = , ∂wij ∀k ∂wij where η is the step size parameter, usually called the learning rate. An ANN is therefore an optimisation technique that attempts to locate the minimum of a multidimensional error surface, which usually includes several local minima. A neural network might not always find the absolute minimum, but an acceptable local minimum close to it. After the training phase, the ANN can be validated for its generalisation capability, by comparing its output with the actual (expected) values, where the input data come from a set of (unknown during the training phase) samples, called the test set. A usual problem that occurs during the training process is over-fitting. The error on the training set may be reduced to a very small value, but when presented with new, unknown test patterns, the network has poor performance (large prediction error) because it has almost memorized the training samples. The tendency for over-fitting increases with the network size, but deciding what is the best size for the network is difficult to know beforehand. Early stopping is a technique that is very often used to stop the training process before the network starts to over-fit. In this method, the available data for training can be split into a training set and a monitoring set, and the error of the monitoring set is also inspected during training. While at the beginning both the training and monitoring errors decrease, when the network begins to over-fit the training data, the monitoring error will start to increase. If this increase continues for a specific number of iterations, the training process is stopped and the ANN parameters (weights and biases) that presented the minimum monitoring error are retained. The motivation behind the use of neural networks, is that the encoding quality of video is primarily depending on the source rate and the level of spatial activity and motion in the video scene3 . Therefore, the ANN model operates on visual content descriptors that are extracted from the input video frames during the encoding process, on a S-T period basis, and directly yields objective quality scores that are highly correlated to the corresponding quality values if the ITS VQM had been used. The function that maps content feature vectors into objective quality ratings is learned by training the neural network. For the (off-line) training process, continuous objective quality scores are obtained by directly using the ITS VQM. The ANN method does not rely on the availability of the distorted version of video frames during its real-time operation. Quality predictions are sought based only on features that are extracted from the original input frames. The main challenge of the ANN design is the extraction of appropriate features from the visual content. These features should (i) adequately represent most of the spatio-temporal activity of video content and (ii), since realtime performance is important, they should be calculated as part of the normal operation of the encoder, so that no significant overhead occurs. Keeping in mind the requirement of real-time processing, a set of content features, summarised in Table 1, are extracted from every original frame within the S-T period. Four of these features measure 3 Under the reasonable assumptions of a non-changing picture size (resolution) and video codec. 6
  7. 7. texture complexity: the pixel activity, PelAct, defined as the standard deviation of luminance pixels in each (8 × 8) block averaged over the number of blocks in the frame, and the spatial spread of pixel activity, PelActSpread, defined as the deviation of block-PelAct values over the frame. Similar features are calculated to measure the ’edges’ activity within a frame. Edges convey significant visual information, reveal texture, and are more susceptible to certain encoding impairments in comparison to flat regions of the image (e.g., blurring distorts the intensity of edges). From a human visual system point of view, spatial and texture masking are sensitive to the intensity of areas with edge activity. To determine the edge activity, we calculate the magnitude of pixel gradients in each block, by applying a Sobel filter (gradient operator) at each pixel value: magn( pi,j ) = |pi−1,j−1 + 2pi−1,j + pi−1,j+1 − pi+1,j−1 − 2pi+1,j − pi+1,j+1 | + |pi−1,j−1 + 2pi,j−1 + pi+1,j−1 − pi−1,j+1 − 2pi,j+1 − pi+1,j+1 |, where pi,j is the luminance value of the pixel at row i and column j in the frame’s pixel grid. The edge activity, EdgeAct is the standard deviation of magn( pi,j ) values in every block, averaged over the number of frame blocks. The spread of edge activity, EdgeActSpread is calculated similar to PelActSpread. Motion related features are also extracted with the aim of covering the range of motion attributes. The sum of absolute pixel differences, soad, is a measure of pixel change between the current (motion-estimated) frame and its reference frame. The average magnitude of the motion vectors (MV) over the whole frame, MVMagn, and the spatial variance of the MVs magnitude, MVMagnVar, are also calculated. To locate frames where strong motion in portions of the image may lead to localised impairments, the average magnitude of MVs is also measured for each of the four spatial quadrants of the frame, resulting in four additional features, MVMagnUL, MVMagnUR, MVMagnLL, and MVMagnLR. The ratio of the motion estimated macroblocks (MB) over the total number of MBs, MERatio, is also calculated as a representative measure of the coding efficiency of the motion estimation process. Motion complexity, MotCompl, is calculated as follows: motion vectors are classified according to the dominant axis of the vector (up, down, left, right, none), and the variance of this five-bin histogram is taken. A uniform histogram of the directional MVs reveals a more complex motion throughout the frame. MotDirChange represent changes in the motion direction, and is formed by subtracting the MVs of successive motion estimated blocks, and averaging over the number of macroblocks in the frame: 1 M otDirChange = mvF (i) − mvF (i) , M i where F represents the reference frame of frame F used for motion estimation. MotAccel captures the change in the motion speed (acceleration), again averaged over the number of MBs: 1 M otAccel = ( mvF (i) − mvF (i) ). M i Descriptive statistics of these features are then calculated over the 6-frame period to obtain content feature descriptors. These summary statistics are the mean, median, standard deviation, minimium and maximum values, the 5, 25, 75 and 95-percentiles. In total, one hundred and thirty five (135) content features descriptors are gathered per S-T activity period. We modified a H.263+ video codec to perform the feature extraction process and the calculation of their descriptive statistics (this process can be applied, with minimal modifications, to any other hybrid DCT-based codec that employs motion estimation). 4.1 Neural network architecture and prediction performance The ANN architecture comprises a two-layer feedforward network with backpropagation, with one hidden layer of nh neurons and non-linear (tangent-sigmoid ) transfer function, and a linear output layer. In 7
  8. 8. Table 1: Content features extracted from the original video frames Content Feature Description PelAct Pixel activity averaged over all blocks PelActSpread Deviation of pixel activity over all blocks EdgeAct Edges activity averaged over all blocks EdgeActSpread Deviation of pixel activity over all blocks soad Sum of abs. pixel differences between adjacent frames MVMagn Magnitude of motion vectors MVMagnVar Spatial variance of motion vector magnitudes MVMagnLL, MVMagnLR, Magnitude of motion vectors per quadrant – low left & right, MVMagnUL, MVMagnUR upper left & right MERatio The ratio of motion estimated MBs in the frame MotCompl Motion complexity (variance of the directional motion vectors histograms) MotDirChange Change of motion direction between adjacent frames MotAccel Acceleration of motion between adjacent frames order to reduce the size of the input vectors that train the neural network, remove redundancies present among the original 135 inputs and retain those variables that are relevant to the model (thus improve both training time and generalisation performance) a data dimensionality reduction process precedes the ANN training process. The first step applies Principal Component Analysis (PCA) on the input data matrix. Principal Component Analysis [12] is a data dimensionality reduction scheme, which is very often used in neural networks. This data compression technique extracts characteristic features from the data whilst minimizing the information loss. The basic principle of PCA is the representation of the data by a reduced set of unit vectors (eigenvectors). The eigenvectors are positioned along the directions of greatest data variance so that the projections from the data points onto the axis of the vector are minimized across the full data set. PCA is applied to the train input vectors (calibration matrix). Therefore, if Fn×m is the calibration matrix and Pm×m is the principal component transformation matrix, the transformed calibration set of patterns is the (n × m) matrix F = F × P. Note that, the same transformation has to be applied to the set of test patterns as well, using the same transformation matrix P derived from the calibration matrix. Usually, most of the data variance can be explained using the first few principal components (PCs) of F . While it difficult to consider the significance of the original input variables to the model, it becomes much easier to do so when the input data are preconditioned with PCA. Input features that are relevant to the model can be derived through a stepwise, trial and error method. For example, with stepwise addition, one may start with an initial small set of inputs (first few PCs), and add a new variable at the time until a satisfactory monitoring or prediction error is achieved. This has the risk that the method may stop with selected input variables P C1 , ..., P Cm , but some important information to the model may also be contained in input P Cn , n > m. With stepwise elimination, a deliberately large subset of initial variables is chosen, and variables are subsequently removed until the monitoring or prediction error improves no longer. The selection process of the appropriate input variables can be improved if the relevance of each variable to the model, called its sensitivity, can be estimated. We use a two-variance-based approach for variable sensitivity determination, proposed in [1]. This method is based on the estimation of the individual contribution of each input variable to the variance of the predicted response of the neural network (which can be derived when the neural network is trained when all the input variables, except the one under consideration, are set to zero). Once all sensitivities are estimated, the variable with the lowest sensitivity is tentatively removed and the ANN is retrained. If the monitoring 8
  9. 9. + + + + + ++ + + + + + + + + + 80 ++ + + + + + + + + + ++ + + + + + + + + + + ++ + + + + + + + + + + + + + + S−T period quality score + + + + ++ + + + + + + + + + + + + + + ++ + + + 60 + + + + + + ++ + + + + + + + + + + + 40 + + o actual scores + ANN predictions * absolute error 20 * * * * ** * * * * * ** * * * * * * * * * ** ** * *** ** * ** *** *** * * **** * * ** ** ** ** * * * * ** ** ** ** * * * ** * * *** * * **** ** * * * * * *** * ** *** 0 0 20 40 60 80 100 Figure 4: This plot depicts a series of neural network predictions together with the actual values obtained from the ITS VQM. The prediction residual is also shown. error decreases, the variable is deemed irrelevant to the model and is removed, otherwise, it is replaced and the process continues with the next variable. At the end of this process, a subset of the initial input features presents the new input features set. In order to determine which subset of principal components (PCs) are relevant to the model, a sensitivity-based, stepwise elimination process is then applied. First, the relevance, or sensitivity of each variable (PC) to the model is estimated, using a two-variance-based approach [1]. This method estimates the individual contribution of each PC to the variance of the predicted response of the neural network (derived when the neural network is trained with all input variables but the one under consideration set to zero). Once all sensitivities are estimated, the variable with the lowest sensitivity is tentatively removed and the ANN is re-trained. If the monitoring error decreases4 , the variable is deemed irrelevant to the model and is removed, otherwise, it is retained and the process continues with the next variable. The stepwise elimination process retained a total of eighteen (18) input variables. A large collection of video scenes were selected to test the proposed neural network model, featuring a wide range of content, camera actions (static, panning, zooming, fades, etc.), and various levels of scene activity. Video frames were extracted from: action movies (The Matrix, Terminator, XMen), sports (extended football clips from the English Premiership) and also several short video clips from the VQEG web site [20]. In total, the test video library contained approximately 39,000 frames (6,500 S-T periods). From the set of 6,500 patterns, 80% were randomly chosen as the training set and the rest 20% as the validation (test) set. One fourth of the training samples formed the monitoring set, and the rest was the actual training set. All sequences in the video library where encoded at several rates, ranging from 100 Kbps to 2 Mbps (with a step of 100Kbps). Multiple neural networks were then trained, each corresponding to a distinct encoding rate in the selected range. We performed the sensitivity analysis and the stepwise elimination of input variables for various configurations of hidden layer neurons. This analysis proved that the value of nh does not significantly affect the prediction performance of the ANN, nevertheless, for the specific data set, a network with nh = 18 produced the smallest monitoring error. We investigated the ANN prediction ability when presented with unknown input patterns. To clearly 4 During the training phase a monitoring set is also fed to the ANN to facilitate better training and improve the generali- sation performance by preventing the network from memorizing the training samples (over-fitting). 9
  10. 10. Actual vs. predicted scores 100 80 Quality prediction 60 40 20 20 40 60 80 100 Actual quality score Figure 5: Actual quality scores vs. ANN prediction for the test set (400 Kbps). visualise that the neural network predictions closely follow the actual S-T quality scores, we plot in Figure 4 a 100-samples long subset of the ANN outputs together with the corresponding expected scores. The bottom line on the same graph corresponds to the absolute error between the actual score and the ANN prediction. Figure 5 shows, for the test set of features (approx. 1300 input patterns, encoding rate: 400Kbps), the actual objective S-T quality scores obtained from the ITS VQM plotted against the corresponding outputs of the neural network. The neural network achieved significant prediction accuracy: the Person correlation between the predicted and expected responses was as high as 0.901, the mean of the absolute residual error was 4.20 with a standard deviation of 3.54. Similar generalisation performance was gained for various encoding bit-rates from the range 100-2000Kbps used in our experiments, as shown in Figure 65 . 4.1.1 Examination of additional overhead The on-line quality predictor introduces two additional processing modules in the live-streaming system: the extraction and statistical manipulation of content features inside the video codec, and the invocation of the neural network quality predictor. The overhead of the feature extraction process and statistical manipulation of the data to the video encoder is not significant. Most chosen features, like pixel activity, soad, motion vectors, apart from the edges energy, are calculated as part of the encoding process, namely for motion estimation, so no additional delay occurs. Features like complexity of motion, acceleration and direction of motion are computed from the values of the motion vectors using simple statistics. Calculation of edge activity in the frame is adding a slight overhead (the Sobel gradient involves twelve additions per pixel)6 . The rest of the processing cost involves the statistical summarisation of both frame-level features (mean and standard deviation over the frame) and S-T period level content descriptors. In total, the additional overhead is on average less than 5 The prediction error of the ANN is even smaller at higher bit-rates, because quality values were usually in the high-end of the scale, allowing the neural network to learn better. 6 Furthermore, we can remove it from the features vector with only a small loss in the prediction accuracy of the ANN 10
  11. 11. Absolute prediction error of the ANN 10 Average abs. error 9 8 7 Absolute error 6 5 4 3 2 1 0 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Encoding bit rate points (Kbps) Figure 6: ANN prediction error at different encoding bit-rates. Error bars extend to the 5 and 95 percentiles. 15ms per 6-frame activity period for CIF-size frames on a 2.2GHz processor) therefore it does not affect real-time performance. Similarly, the overhead of the neural network is also negligible: by nature, an ANN might require significant amount of time to train but the process of calculating a response involves a mere number of operations on the input variables vector. 5 Estimation of encoding rate We can achieve a more stable target quality Qtarget by smoothing out transient increases and especially drops of Qtcpf . At the same time, Qtarget has to be responsive to consistent changes in Qtcpf . As Qtarget deviates from Qtcpf , so does the encoding rate Renc in relation to the transmission rate Rtcpf . While mismatches between the source and channel rates can be alleviated by the sender and receiver buffers in the short term, Qtarget has to follow the ’trend’ of Qtcpf in the longer-term. The basis of the approach is to calculate the target quality value as an moving average (MA) of Qtcpf : Qtarget (t) = α · Qtarget (t − 1) + (1 − α) · Qtcpf (t), α ∈ [0, 1] for successive S-T periods t. MA predictors are quite simple but the main design difficulty is the choice of weight α. Given that in practice the variation of Qtcpf is unknown, setting α to a high value leads to successful elimination of large variations but lacks responsiveness and compromises the stability of buffers, while a small value fails to decrease variations. The desired approach is to determine α on-line, according to changes in Qtcpf and the status of the two buffers. We introduce a fuzzy logic controller [14] to dynamically calculate appropriate values for α. Fuzzy logic was introduced by Zadeh [26] to describe vagueness in system behaviour, where variables or parameters do not exhibit exclusive set membership, but a gradual transition between states (or, a grade of membership). In our case, a fuzzy controller is useful because, while it is difficult to analytically describe the system’s behaviour, as the output rate of the encoder can hardly be characterised accurately and the transmission rate is unknown beforehand, we know qualitative what the behaviour of the system should be. 11
  12. 12. error buflev α neglarge poslarge medium zero small medium large 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 low high negsmall possmall −0.2 0.2 0.3 0.7 0.3 0.4 0.65 0.75 0.92 −0.4 0.4 −1.0 −0.5 0.0 0.5 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Figure 7: Membership functions of all fuzzy sets for the controller’s inputs (error and buflev ) and output (parameter α). With respect to the sender and receiver buffer sizes, the difference between the encoding and trans- mission rates has the following effects: if Renc is higher than Rtcpf then the sender buffer size increases, if equal, it remains unchanged, otherwise it decreases. Similarly, the receiver buffer fills, remains unchanged or empties, when the transmission rate is higher, equal or lower than the video data playout rate. There are two inputs to the fuzzy controller: error and buf lev, and one output, the EWMA parameter α. The input error is associated with the change in the value of Qtcpf (or quality error ) between successive ST-periods: ∆Qtcpf = Qtcpf (t) − Qtcpf (t − 1). This change represents the level of short-term variation in Qtcpf that we like to curtail7 . The input buf lev ∈ [0, 1] reveals how buffered data is distributed between the send and receive buffers, and is defined as: buf lev = Br /(Br + Bs ), where Bs , Br are the sender and receiver buffer sizes respectively. The value of this variable is a convenient way to establish whether the sender or receiver buffer level is low. If the receive buffer runs low (Br → 0) then buf lev → 0, while if the send buffer approaches underflow levels (Bs → 0), buf lev → 1. Therefore, the system can determine at any time how video data is distributed between the two buffers, monitor if any buffer runs at low levels and react accordingly. Note that, for low packet loss rates, the sender can quite accurately track the size of the receiver buffer at any time, by continuously updating a buffer sz variable: buf f er sz(t) = buf f er sz(t − 1) + (Rtcpf (t) − Renc (t)) · T where, T is the duration of an adaptation period. Packet loss can however contaminate the accuracy of this estimate. Alternatively, a method that lets the receiver feedback this information back to the sender is preferable. We define five gradations for the fuzzy input error (linguistic values that error takes on): large negative (neglarge), small negative (negsmall), zero, positive small (possmall), and positive large (poslarge). The buf lev variable takes on three linguistic values: low, medium and high. These gradations are enough to describe the different states of both buffers; adding further gradations does not present any obvious advantage and introduces unnecessary complexity. A fuzzy value ’low’ means that the receiver buffer runs low, a ’high’ value that the sender buffer’s level is low, while a ’medium’ value that there is enough data distributed, more or less evenly, between the two buffers. Finally, the gradations of the controller’s output α are represented by three linguistic variables: small, medium and high. We use standard trapezoidal fuzzy sets and the corresponding membership functions are shown in Figure 7. The last step in the design of the fuzzy controller is the definition of the rules that govern its operation. The controller opts for preserving a fuzzy large α in order to preserve stable evolution of Qtarget . However, when buf lev is fuzzy low and there is a fuzzy negative error, α needs to take a smaller value to avoid a 7 Input error is scaled in the [-1,1] range, using error = ∆Qtcpf /20, since the majority of ∆Qtcpf values are confined in the [-20,20] range, as established through quality experiments with numerous video clips. 12
  13. 13. receiver buffer underflow. Qtarget then gets close to Qtcpf , thus avoiding a Renc value that is much higher than the transmission rate Rtcpf . Analogous rules are employed when error is fuzzy positive and buf lev is high, to avert a sender buffer underflow. Using this approach of quality control, preference is given to avoid the target quality from dropping to low values. This is in accordance to subjective experiments that reveal that quality during a time interval is primarily determined by the worst impairment observed and that drops of quality have a greater negative impact than the positive effect of an equal in size quality increase [25, 9]. Using the above guidelines, the complete set of control rules of the fuzzy controller are defined as follows: 1. if error is neglarge and buflev is low then α is small 2. if error is neglarge and buflev is medium then α is large 3. if error is neglarge and buflev is high then α is large 4. if error is negsmall and buflev is low then α is medium 5. if error is negsmall and buflev is medium then α is large 6. if error is negsmall and buflev is high then α is large 7. if error is zero and buflev is low then α is large 8. if error is zero and buflev is medium then α is large 9. if error is zero and buflev is high then α is large 10. if error is possmall and buflev is low then α is large 11. if error is possmall and buflev is medium then α is large 12. if error is possmall and buflev is high then α is medium 13. if error is poslarge and buflev is low then α is large 14. if error is poslarge and buflev is medium then α is large 15. if error is poslarge and buflev is high then α is small 5.1 Performance of the quality controller As discussed in section 4.1, the ANN network is trained to provide predictions at discrete operating bit-rates R0 , R1 , ..., RN . By performing a small number of invocations of the ANN, the rate controller performs the simple task of finding i ∈ {0, ..., N − 1}, such that, QRi ≤ Qtarget ≤ QRi+1 ). The overhead of this process is insignificant. Assuming that QR is an increasing function of R, Renc is then found by interpolating between Ri and Ri+1 . To avoid Renc getting much higher than Rtcpf , causing the receiver buffer to drain quickly, we allow it to increase relative to the instantaneous receiver buffer occupancy, i.e., up to ratio = 1 + buf lev times more than Rtcpf during the same period (so ratio ∈ [1, 2]). We investigated the ability of the fuzzy controller to (i) provide a smooth encoding quality and (ii) avoid starvation of the sender and receiver buffers. In a simulated transmission scenario using ns-2 [16], 8000 video frames (≈ 260 sec) from an action movie (The Matrix ) were transmitted using a TCP-friendly flow (TFRC) [4]. The test video sequence contained scenes with various levels of content activity. The simulation topology was a typical dumbbell network with bottleneck bandwidth set to 10Mbps and delay to 20ms. To create a realistic variation of bandwidth, a number background ON/OFF CBR flows, with ON and OFF times drawn from a Pareto distribution [23] also traversed the bottleneck link. The mean 13