Your SlideShare is downloading. ×
Mad HVEI 2009
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Mad HVEI 2009

617
views

Published on

My thesis topic presented at Human Vision and Electronic Imaging. It is Also appears in the Journal of Electronic Imaging 2010

My thesis topic presented at Human Vision and Electronic Imaging. It is Also appears in the Journal of Electronic Imaging 2010


0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
617
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Good Afternoon, my name is Eric Larson and today I will be presenting research from some work that we have done on finding a new algorithm for Quality assessment. I researched this jointly with Dr. Damon Chandler and Cuong Vu at the Image Coding and Analysis Lab at Oklahoma State University. In this presentation, I will be discussing a new metric of visual fidelity that we have been working on in the lab. This work is meant present a new algorithm, but also in many ways to bridge the gap between two camps of researchers in the field of image quality.
  • For an example, image trying to rank these images by quality, where the original is at the top. You might rank them something like this. (Click)
  • There are many algorithms today that attempt to predict this type of ranking. What sets this work apart is that we argue that the strategy of quality assessment changes depending on the intensity and amount of distortions in an image. So we can set images into two groups: (Click) High quality and low quality. High quality images must be processed differently than low quality images. For images containing near-threshold distortions, the image is most apparent, and thus the HVS attempts to look past the image and look for the distortions (a detection-based strategy). For images containing clearly visible distortions, the distortions are most apparent, and thus the HVS attempts to look past the distortion and look for the image’s subject matter (an appearance-based strategy). In this work, we adaptively change our strategy based upon if the image is of high quality, low quality, or somewhere in between based upon how apparent the distortions are to the observer, thus the name of our metric, …
  • Most apparent distortion or MAD.
  • So that is the ultimate motivation of this research. The outline of the talk is as follows: First I will give a brief introduction to our approach, how it relates to current quality indices out there today. I will then present some motivating examples that convey why there exists two different strategies. (Click) I will then present the methods we used for incorporating those strategies into MAD. (Click) And show the comparative results of MAD on two databases of image fidelity. (Click) Finally, I will conclude and summarize the talk.
  • Image quality algorithms can be fit into one of three categories. There are those based upon mathematical convenience, like MSE and PSNR. These methods are fast and easy to embed. (Click) There are also algorithms based upon low level properties of human vision. Some notable indices are Daly’s VDP and Chandler’s VSNR. These metrics are based on our ability to detect distortions. (Click)
  • There are also metrics that base measurements upon overarching principles of human vision, like structure and information content. Some notable metrics are Dr. Zou Wang’s structural similarity or SSIM, or And Dr. Hamid Sheikh’s visual information fidelity or VIF which attempts to capture mutual information between natural scenes and images. One thing to notice is that I have placed VSNR into both categories –because it is a hybrid metric. It assesses quality based upon a theory of global precedence, which is more like an overarching principle of vision, but also uses low level vision to determine visibility and perceived contrast.
  • What we want to stress here is that all of the indices attempts to account for the single most relevant factor when assessing quality. We argue that the single most relevant factor changes depend upon the degree of distortion in the image.
  • It is interesting to look at how each index performs on subsets of images from the LIVE database. If we define high quality images as images which have a differential mean opinion score of less than 20, then we get the following predictive performance from the algorithms. (Click) This bar graph contains two measures of predictive performance for PSNR, VSNR, SSIM, and VIF. The dark bars represent the CC between DMOS and the predicted quality of the image for each algorithm, higher being better. The lighter bars represent the outlier ratio, which you can think of as a percentage of incorrect predictions or predictions that fall outside a tolerance. Of note here is that the best performing algorithm at high quality is VSNR, then VIF, SSIM, and PSNR. This is true looking at CC or outlier ratio – same trend. The conclusion to draw from this analysis is that VSNR models something better at high quality. We can conjecture the possibilities of why, but, really, we do not know. Let’s just keep in mind the ordering.
  • The next comparison we want to look at is low quality images. The graph shows the same information as the previous graph, except we are now looking at correlation and outlier ratio between the four algorithms and DMOS ratings that were above 60 – that is to say, these are considered highly distorted images. The trend changes dramatically. VIF is by far the best predictor. It is followed by VSNR and SSIM, which are about the same in performance, depending on which measure is being used to evaluate. And then PSNR follows as the worst prediction measure. It is of note that distortion energy is not nearly as important at low quality as it is at high quality. So looking at these trends, the obvious question is why does changing the subset based on subjective quality affect performance? We have argued that it is because humans use a different strategy based on whether the image or the distortion is more apparent. So maybe we can get an idea about the differences in performance from an example.
  • At high quality we are interested in two things: is the distortion visible and how intense does it looks. Imagine for a moment trying to rate a set of high quality images. What strategy might you use to rate and order them?
  • Here there is a reference image shown and two distorted versions. The middle image has some mild JPEG artifacts and the last is distorted by white noise. I have cropped the images and increased the contrast to make the distortions more visible. The average observer rates these images in the order they appear.
  • I would generally agree with the average observer. But what strategy was used to make that assessment? When given the task to discriminate these images, we go looking for the distortions and make judgments based on how well we can see them and how many there are. In addition, it is important to note what we are not doing here. The image content is not lost. I can still see that this is a biker. What I mean by that is the distortions do not appear as losses of edges and texture in the image. Both distorted images look like the original with some type of mild distorting screen in front of them. This is different than highly distorted images.
  • That is especially true when predicting a very high quality image like this one. Now the artifacts in this image are not visually detectable, but the MSE picks up many distortions as …
  • Difference image shows. This shows that it will be highly important to know when distortions become detectable. MSE completely fails for this image. Okay so we now have a possible strategy for high quality. What about low quality?
  • Instead of detecting distortions, we want to know how much the distorted version looks like the original. One way to think about this is that at low quality, distortions do not appear like a screen anymore. They appear as if to destroy the content of the image. For example let’s look at a set of highly distorted bikers.
  • Like these. (BL) Here we have white noise (BR) blurring, and, in these two, massive amounts of JPEG2000 artifacts. It is easy to see what I mean by the content of these images being destroyed. The average observer rates this images like this…
  • I would agree with the average observer again. But how did we perform the assessment? Let me ask that differently. What did we not do? When you assess quality of the (BR) white noise as less than (TL) blurring is it not because the grass masks the blurring better? The distortions are so apparent, that we are more interested in how edges and textures are preserved. These two examples illustrate how different strategies are used in assessing quality. We need to develop a way to approximate each strategy used.
  • I would agree with the average observer again. But how did we perform the assessment? Let me ask that differently. What did we not do? When you assess quality of the (BR) white noise as less than (TL) blurring is it not because the grass masks the blurring better? The distortions are so apparent, that we are more interested in how edges and textures are preserved. These two examples illustrate how different strategies are used in assessing quality. We need to develop a way to approximate each strategy used.
  • For high quality images, we simulate the distortion detection task using masking. Then we incorporating this with the local mean squared error to model how the distortions degrade quality. (Click) For low quality, we capture appearance using a log-Gabor filter bank. Namely, we use the change in statistics of each sub-band in the log-Gabor filter bank. This a strategy that has long been used for modeling texture appearance and categorization in computer vision. Lastly, we adaptively combine these two strategies into one index. If the image is high quality, we take our value from top approximation. If the image is low quality, we use the lower approximation. For images in between, we use both strategies.
  • For high quality images, we simulate the distortion detection task using masking. Then we incorporating this with the local mean squared error to model how the distortions degrade quality. (Click) For low quality, we capture appearance using a log-Gabor filter bank. Namely, we use the change in statistics of each sub-band in the log-Gabor filter bank. This a strategy that has long been used for modeling texture appearance and categorization in computer vision. Lastly, we adaptively combine these two strategies into one index. If the image is high quality, we take our value from top approximation. If the image is low quality, we use the lower approximation. For images in between, we use both strategies.
  • For our detection strategy, we first convert to linear luminance using the standard gamma compensation curve. So L is the luminance of the image. Once we have approximate luminance, we then take the cube root of the luminance to convert to l star space
  • Next we filter the reference and distorted images by the 2D CSF. We perform this operation in the Fourier domain and convert back to the spatial domain. L star becomes I prime. At this point we have images that are roughly linearly related to our perception of brightness and contrast. Next we need to account for the masking of distortions.
  • A common model of masking is the local RMS contrast of the reference image and error image. For each patch, p in the reference image, we calculate the local standard deviation. Each patch is a 16x16 block. If the reference image is a busy texture then it can hide distortions well. The standard deviation is a good model of this, with one caveat: edges also have large standard deviations but do not mask well. If we just use the std, our model will think that a block with a hard edge can mask considerably well. To combat this we have devised a modified measure of standard deviation.
  • We set sigma ref to the minimum std of four sub windows inside patch p. The beauty of this is that when we encounter an edge (Click), it likely does not intersect all four sub-windows. So we know that the edge cannot mask. (Click) But when we get out to a busy patch, assuming stationarity, the std will be about the same as the block as a whole. And we get a measure of masking potential.
  • We then find the RMS contrast of the distortions. Strong distortions will have a large std. Notice that this has a luminance threshold. This is because very dark areas in an image do not show distortion well.
  • We define visibility as whether the contrast of the distortion is greater than ¾ of the modified contrast of the reference image. This threshold is set experimentally by viewing the mask and the distorted image. Also notice that doing this negates the need to normalize by the brightness of the patch. We are keeping it in this format because we hope to use the difference between the two contrasts as a measure absolute visibility, but preliminary results show that a hard threshold works best when combining with the next part of our model.
  • We use the local MSE of each patch p to get a map of distortion intensity in the image. And the rest is a pretty standard definition of MSE. So this results in an LMSE map of the distortions in the image.
  • We then combine the LMSE map and Visibility map by point-wise multiplication, and collapse them into a single number using the two norm for our error pooling. So this is the index for high quality images.
  • For high quality images, we simulate the distortion detection task using masking. Then we incorporating this with the local mean squared error to model how the distortions degrade quality. (Click) For low quality, we capture appearance using a log-Gabor filter bank. Namely, we use the change in statistics of each sub-band in the log-Gabor filter bank. This a strategy that has long been used for modeling texture appearance and categorization in computer vision. Lastly, we adaptively combine these two strategies into one index. If the image is high quality, we take our value from top approximation. If the image is low quality, we use the lower approximation. For images in between, we use both strategies.
  • For low quality we have argued that appearance is the key to quality. We model this using statistics of the log-Gabor filter bank.
  • Like this, G s o is the filter at scale s and orientation o.
  • We gather the appearance from the filter bank outputs. We use the statistics of each patch. This is motivated by our past research and the research of others. We showed that statistics of log-Gabors on animal camouflage patterns can provide insight into the mammalian visual system and explain why some patterns appear more camouflaged than others. In particular, Fred Kingdom and others have shown that individually changing the Variance, skewness, and kurtosis in synthesized texture patterns is easily detectable by the human eye, and that perceptually the three statistics are good indicators for discerning visual appearance.
  • For each patch p in the image, we collect measures of standard deviation, skewness, and kurtosis at each scale and orientation of the filter bank for the reference and distorted image. (Go through equation). And we sum the differences of each patch into a single map, eta. We give coarser scales more weight because they are more perceptually annoying. The weights are learned from the A57 database at Cornell.
  • We collapse the statistical difference map, eta, into a single number using the vector two norm for our error pooling.
  • So now we have a low and high quality prediction of visual fidelity. How to we combine them into a single metric?
  • At this point we have two indices for assessing fidelity of very high quality images and very low quality images. What is going on in for images in between? IN this image for example, the bridge has statistical appearance changes (this wall is completely changed), but the trees and sky are largely masking the distortions. And this part of the bridge looks normal… We capture this interaction by adaptively blending the output of the two metrics.
  • We take a geometric mean of the two outputs weighted according to the output of PD high. Alpha varies from 0 to 1 depending on PD high. When PD high is small, alpha has a value of one. As PD high increases, alpha gradually approaches zero. That is where we want more of the LOQ metric to take over the output. Beta 1 and beta 2 are learned from images in the A57 database at Cornell. So this is MAD, the most apparent distortion. Low values indicate that the image is of high quality. A value of 0 indicates perfect quality.
  • For high quality images, we simulate the distortion detection task using masking. Then we incorporating this with the local mean squared error to model how the distortions degrade quality. (Click) For low quality, we capture appearance using a log-Gabor filter bank. Namely, we use the change in statistics of each sub-band in the log-Gabor filter bank. This a strategy that has long been used for modeling texture appearance and categorization in computer vision. Lastly, we adaptively combine these two strategies into one index. If the image is high quality, we take our value from top approximation. If the image is low quality, we use the lower approximation. For images in between, we use both strategies.
  • The first database we chose to test the performance of MAD on was the LIVE database from UT Austin. This database contains 779 distorted versions of 29 original images. The study used 29 observers, resulting in over 20000 ratings of visual fidelity. The ratings are in terms of Differential mean opinion score or DMOS. Lower DMOS denotes higher quality. LIVE has reported their average DMOS rating and the standard deviation of DMOS ratings per image. The next database we tested MAD on is a preliminary version of a database of our own creation. Eventually, we plan to have many ratings and images of different types, fit into categories. Thus the name categorical subjective image quality or CSIQ. At this point, however, we only have ten observers and 10 original images. And ratings for 300 distorted version of the images. We are working to expand the database, but we wanted to show the performance of the various metrics on CSIQ. The database contains six distortion types:
  • Here is the summary of the results on LIVE. This table shows the three measures of performance CC, SROCC, and the outlier ratio. From this summary, notice that VSNR and SSIM perform about the same overall, with SSIM winning marginally in all but outlier ratio. Seeing the outlier ratio and distance side by side here shows us that when VSNR is wrong, it is really wrong.
  • Because VIF is the next best performing metric we will take this comparison a little further.
  • These two graphs show the predicted quality of MAD and VIF. The vertical axis denotes the DMOS rating and the horizontal axis denotes the predicted quality, after logistic transformation. Notice that MAD is a noticeably tighter fit to the data than VIF.
  • For the most part MAD has Gaussian residuals except for one place…
  • Here there are several images where MAD predicts the images are of low quality, while they are actually high quality images. Failure cases can be important clues to the improvement of a predictor, and these images point out a failure of our masking model. It does not correctly classify the image as being high quality, and the appearance strategy is used inappropriately. Outliers like these are detrimental to the CC.
  • This table shows statistical significance based on the F test. A zero signifies that the distributions are the same with 99% confidence. A one denotes that the index in the row is statistically better than the index in the column, and a negative one denotes that the row is worse than the column. Notice that MAD is statistically better then all the other metrics. VIF is better than everyone except MAD.
  • This table summarizes the three performance measures used and their values for each of the quality predictions on all images in the CSIQ database. MAD clearly performs better than all other metrics, in all categories. The CSIQ database includes contrast and pink noise distortions. Distortions which none of the algorithms were trained or evaluated on. This comparison shows the generalization of each index and how powerful MAD could be for a wide range of applications.
  • Notice that there are no real visible outliers, so on CSIQ, MAD is homoskedastic. There is still room for improvement, but nothing appears to be glaringly wrong with the analysis.
  • The stat table is shown. Not much new here. MAD is statistically significant over all other metrics. On the types of distortions and images in CSIQ, SSIM and VIF are statistically no different.
  • In conclusion, (read out) We are currently working on a better masking model and an implementation of MAD using only 9/7 wavelet filters.
  • Are there any questions?
  • Are there any questions?
  • Here is a quick look at the references for assertions and images in the presentation that we did not make, the complete list can be found in the thesis.
  • The pink noise shows up as a color type distortion. Then we have blurring, less intense pink noise, contrast loss, JPEG artifacts and, JPEG2000 artifacts. Take these results with a grain of salt, we do not have quite enough observers to average out the STD between observers. However for most images the error bars are quite small.
  • As seen in this example, the window is slide across the image and SSIM index is saved into a map of the image. In this example, black denotes more distortions and the map shows portions of the image that are most distracting. It is a fairly good measure of change in structure as you can see. The most annoying blurring artifacts appear here, which the map picks up quite well. Once this is completed, …
  • The final metric is then the division of the information in the sub-bands of the distorted image over the information in the sub-bands of the reference image. This operation makes sense. What information we get from the distorted image, over the information that we could have gotten from the reference. Of course you are likely wondering how they model the mutual information of the natural images and reference image. Keep in mind that a complete understanding of VIF is not necessary. But it is interesting to look at the operations. First the information in the reference. They have decomposed the reference into the wavelet domain. Then they can calculate the projection onto the natural image covariance basis (now there is no such thing as the natural image covariance basis, if natural images were a linear sum of elements analysis would be easy, so instead they approximate it using a GSM of wavelet coefficients that are common to natural images). There are Some scaling constants for each sub-band, the s sub js, and there is a visual noise variance parameter sigma n. Sigma n is their model of noise induced by the HVS. Our eye and nervous system are summed up in a single variance. You can see why Camp One might not be happy with this method. SO the information content of image one is calculated. In the same respect, the distorted image is modeled by this, where the new additions are the distortion variance sigma v and the gain control parameter g sub j. All this is really saying is that g sub j is the multiplicative distortion (like blurring) and sigma v is the additive noise variance (like white noise in the wavelet domain). These two equations are similar to the equation for mutual information . A direct understanding of VIF is not necessary (the metric uses approximations of entropy and the entropy of natural scenes is entirely unknown). Just know that as of current VIF is the best performing metric of visual fidelity and is the metric that everyone is trying to best. Now, let’s talk about a Hybrid metric. I = sum( joint*log( joint/(mult prob) ) ) I = H(x)+H(y)-H(x,y) or H(x) – H(x|y) H = - sum[ p(x) log( p(x) ) ]
  • Here is a reference image and four distorted versions. The projector is not the ideal source but bear with me for a moment. (BL) This image has some additive white noise distortions. (TL) This has some JPEG DCT artifacts around the edges of the light house. (TR) This has some JPEG2000 wavelet blurring and ringing. (BR) And this image has the same distortions as this one only less intense. What rating would you possibly give these images? Lets see how the average observer rates these…
  • I generally agree with the ideal observer. But what strategy did we use to come up with this rating of quality? Well, we went looking for distortions, and then asked how well we could see the distortions. Obviously at high quality, masking and intensity are extremely important to this task. Are we interested in how the structure of the image is distorted? Maybe a little bit in these two images, but in these quality is a distortion detection task. Not structure. So what about at low quality?
  • The original image is shown at the top. This is a low frequency distorted image. This is also a low frequency distorted image, but the frequencies are slightly higher. The distortions are the exact same intensity, but which one is most appearance changing? Clearly this image differs more in appearance than this from the original. So there is a precedent for weighting low frequency statistical changes more than higher frequency. The exact weights that we chose are the best performing integer weights for low quality images. We could have fine tuned them for ultimate performance, but there is not a real biological precedent setup for weighting statistics and we did not want to over optimize the algorithm, so we adjusted these as integers.
  • The first database we chose to test the performance of MAD on was the LIVE database from UT Austin. This database contains 779 distorted versions of 29 original images. The study used 29 observers, resulting in over 20000 ratings of visual fidelity. The style of experiment was one up random, where the original and distorted image appeared at the same time and the observer provides a rating. The ratings of fidelity were converted to z scores for each observer and converted further into ratings of differential mean opinion score or DMOS. Low DMOS values denote higher quality. A realignment of the scores was also completed so that across images the DMOS ratings were accurate. LIVE has reported their average DMOS rating and the standard deviation of DMOS per image. The distortions present in the database are compression artifacts like jpeg, and jpeg2000, and photographic artifacts like additive white Gaussian noise and Gaussian blurring. In addition there were images in which there was simulated packet losses of a jpeg2000 compressed image stream. This resulted in images with jpeg 2000 artifacts and some ghosting in the reconstruction. Some representative stimuli …
  • Here there are several images, five to be exact, where MAD predicts the images are of low quality, while they are actually high quality images. Failure cases can be important clues to the improvement of a metric, but for now let’s see how much these outliers are bringing down MAD.
  • These graphs again show DMOS plotted versus MAD and VIF. Except the five worst predictions from each metric have been removed. VIF, having no real outliers, only improves marginally. However, MAD improves quite a bit in terms of correlation. We are currently working on improving MAD to remove these outliers. Keep them in mind as they will come into play again.
  • The past comparisons show how well the metrics perform and it would seem that MAD is better than the others. However, when we talk about being better, we need to do so in terms of statistical significance. Because we are talking about predicting the DMOS dataset, we can talk about this in terms of regression. We can use the mathematical foundations of regression to judge performance. So how do we say that one regression is statistically better than another regression fit? To do so we talk about the distribution of the residuals, or the left over distance between the actual fit and the line of best fit. Ideally the residuals would have no variance and the mean would be zero. That is to say they are exactly on the line of best fit. So if one fit has a smaller variance, then it is possible that it is better. If the distribution of the residuals is Gaussian, then we can say more about the residuals from their variance difference. Let’s look at it this way…
  • Lets say we have four sets of residuals from four different fits and that follow they follow Gaussian distributions. The first, shown in green, is always lower than the others, meaning it always over estimates the DMOS rating. The other three all represent the metric reasonably well, but have the same mean, zero. Obviously, the tallest peak is the best because it has the smallest variance. It is always close to the DMOS. But what happens when the distribution is not continuous? There is still a probability that the we could draw a blue distribution from the one in red. That is to say that we could draw 10 samples from the red distribution and it would look like it came form the blue. How could we say that we are drawing from the better distribution and not the red one? Well, we cannot with absolute certainty. But we can say that one is drawn from a different distribution with a probability. It turns out that given the number of samples drawn and the two variances, the probability that they are drawn from the same Gaussian distribution follows an incomplete Beta function, and is called the F test of significance. Now I am not a statistician, so I cannot tell you exactly why it follows the incomplete Beta function, but I am an engineer, so I can tell you how we can use this. If two residuals are Gaussian, one has a smaller variance, and the Beta distribution says there is a 99% probability that they are from different distributions, then the one with the smaller variance is statistically better. So that is all. We show that they are different distributions with 99% confidence and that one has a smaller variance right? Wrong. We now have to show that the residuals are sufficiently Gaussian. I am not going to go in depthly here because if the residuals are not Gaussian, there is not a real good way of establishing statistical significance, so we assume that they are Gaussian. There is a measure of Gaussian-ity called the JB statistic. The lower the value the more Gaussian the distribution is. It is largely based upon the skewness and kurtosis of the distribution in question. We will report on it, but just know that low values of the JB statistic denote Gaussianity.
  • Here are two tables. The first shows statistical significance based on the F test. A zero signifies that the distributions are the same with 99% confidence. A one denotes that the metric in the row is statistically better than the metric in the column, and a negative one denotes that the row is worse than the column. The second table shows some measures of Gaussianity. Notice that MAD is statistically better then all the other metrics. VIF is better than everyone except MAD. However, also notice that MAD’s residuals are the least Gaussian. Its JB statistic is 100 times higher than some of the others. At first glance it seems to discredit the significance test. But let’s look a little deeper into what is going on.
  • This is a distribution of the MAD residuals. It appears Gaussian except for some outliers, the same ones that were detrimental to the CC.
  • And there they are. But this is actually not a problem.
  • Without them the JB statistic says MAD is the most Gaussian metric of all of them. And the fact that MAD is not Gaussian because of outliers says that the variance is reported as even higher than it would be without them. What this means is that the outliers make it harder for MAD to be statistically significant, but it still is. We think that even though the outliers need addressing by a better masking model, it only bolsters the significant performance of MAD.
  • The CSIQ database procedure was the table top approach. The reference image and all distorted versions are laid out onto a table. The observer places the images a distance from the original according to the quality that they believe the distorted image is. They are asked to make the placement linear with quality. We chose to make this process electronic so that we could control viewing distance and quickly get the quality results for each image. The table top is actually a four monitor array.
  • The setup looks something like this. As you can see viewers can place images, readjust according to other images. Scroll up and down, etc. This greatly reduces the noise from remembering other quality judgments made. The observer can fine tune all quality judgments. Actually what you see here is a realignment portion of the experiment. Once the observer has completely assessed all the images, he or she views a subset of distorted images from each original image that are placed according to their own ratings. He or she then adjusts them so that the ratings are valid across image types. We converted the ratings of the observer to z scores and adjusted them to scale from 0 to 100. One important difference to note about CSIQ and LIVE is that CSIQ contains many high quality images. One complaint about LIVE is that many of the images are of lower quality. So here is how MAD compares to others on CSIQ.
  • The stat table is shown and the measures of Gaussianity. Not much new here. MAD is statistically significant over all other metrics and everyone is Gaussian or close to being Gaussian, except VSNR. Okay. So why is VSNR performing in such a bipolar manner? Well, it turns out that if we look at the graph of VSNR it has a considerable number of outliers that belong to the contrast regime. In light of this we decided to see how everyone performs when contrast is removed. Certainly this type of distortion can be easily corrected, so it is not wholly unacceptable to see the results without contrast.
  • Now we get a more complete picture of what is going on. Here we have the five performance measures for CSIQ but without the contrast distortion images. VSNR performs quite well. Not as good as MAD, but much better than it did. MAD performance also went up marginally to about the way it performed on LIVE. PSNR does not do well. SSIM went up and is on par with VIF, or slightly better. It is interesting that VSNR does so well on CSIQ but not on LIVE. Perhaps because we have so many high quality images, it helps VSNR perform as it does use properties of low level vision.
  • Transcript

    • 1. Most Apparent Distortion A dual strategy for full reference image quality assessment Eric Larson Cuong Vu, Damon Chandler Image Coding and Analysis Lab (ICAN)
    • 2. Motivating example…
    • 3. Motivating example…
    • 4. Motivating example… Most Apparent Distortion MAD
    • 5. Outline
      • Introduction/Motivation
        • Current Methods
        • An example of two strategies
      • Methodology of MAD
        • Detection modeling
        • Appearance modeling
        • Adaptation
      • Results
      • Conclusion
      Motivation Methods Results
    • 6. Three Types of Indices
      • Mathematical efficiency
        • Peak Signal-to-Noise Ratio ( PSNR ), Mean-Squared Error ( MSE )
      • Low level properties of Human Visual System ( HVS )
        • Visual Difference Predictor ( VDP ) [Daly,1992] , Perceptual Structure [Carnec, et al., 2003] ,Visual Signal-to-Noise Ratio ( VSNR ) [Chandler,Hemami,2007] , Wavelet-based quality assessment ( WQA ) [Ninnassi, et al. 2008] ,
      Motivation Methods Results
    • 7. Three Types of Indices
      • Mathematical efficiency
        • Peak Signal-to-Noise Ratio ( PSNR ), Mean-Squared Error ( MSE )
      • Low level properties of Human Visual System ( HVS )
        • Visual Difference Predictor ( VDP ) [Daly,1992] , Perceptual Structure [Carnec, et al., 2003] ,Visual Signal-to-Noise Ratio ( VSNR ) [Chandler,Hemami,2007] , Wavelet-based quality assessment ( WQA ) [Ninnassi, et al. 2008] ,
      • Overarching principles of human vision
        • Structural SIMilarity ( SSIM ) [ Wang,2004] , Visual Information Fidelity ( VIF ) [Sheikh,2006] , VSNR, Perceptual Structure
      Motivation Methods Results
    • 8. Three Types of Indices
      • Mathematical efficiency
        • Peak Signal-to-Noise Ratio ( PSNR ), Mean-Squared Error ( MSE )
      • Low level properties of Human Visual System ( HVS )
        • Visual Difference Predictor ( VDP ) [Daly,1992] , Perceptual Structure [Carnec, et al., 2003] ,Visual Signal-to-Noise Ratio ( VSNR ) [Chandler,Hemami,2007] , Wavelet-based quality assessment ( WQA ) [Ninnassi, et al. 2008] ,
      • Overarching principles of human vision
        • Structural SIMilarity ( SSIM ) [ Wang,2004] , Visual Information Fidelity ( VIF ) [Sheikh,2006] , VSNR, Perceptual Structure
      • Single most relevant strategy is modeled
      Motivation Methods Results
    • 9. Motivation
      • How does each index perform on high quality images?
      Motivation Methods Results
    • 10. Motivation
      • Low Quality Images?
      Motivation Methods Results
    • 11. A Task for High Quality
      • Can we see the distortion?
      • How intense is the distortion?
      Motivation Methods Results
    • 12. A Task for High Quality
      • Can we see the distortion?
      • How intense is the distortion?
      Original JPEG Noise Motivation Methods Results
    • 13. A Task for High Quality
      • Can we see the distortion?
      • How intense is the distortion?
      17.7 23.5 0.0 Original JPEG Noise Motivation Methods Results
    • 14. A Task for High Quality
      • Can we see the distortion?
      • How intense is the distortion?
      [9] Motivation Methods Results
    • 15. A Task for High Quality
      • Can we see the distortion?
      • How intense is the distortion?
      [9] Motivation Methods Results
    • 16. A Task for Low Quality
      • How much does the image look like the original, given that there are so many visible distortions?
      Motivation Methods Results
    • 17. A Task for Low Quality
      • How much does the image look like the original, given that there are so many visible distortions?
      [9] Motivation Methods Results
    • 18. A Task for Low Quality
      • How much does the image look like the original, given that there are so many visible distortions?
      74.64 59.0 67.09 82.74 [9] Motivation Methods Results
    • 19. A Task for Low Quality
      • How much does the image look like the original, given that there are so many visible distortions?
      Motivation Methods Results
    • 20. Motivation Summary
      • Approximate high quality task:
        • Visibility
        • Intensity
      • Approximate low quality task:
        • Preserve content (appearance)
      • Adaptively change strategies
      Motivation Methods Results
    • 21. Methodology High Quality
    • 22. A Strategy for High Quality
      • Conversion to perceived brightness
        • Pixel to luminance
        • luminance to L*
      Motivation Methods Results
    • 23. A Strategy for High Quality
      • Conversion to perceived brightness
        • Pixel to luminance
        • luminance to L*
      • Contrast sensitivity
      Motivation Methods Results
    • 24. A Strategy for High Quality
      • Contrast and luminance masking
      Motivation Methods Results
    • 25. A Strategy for High Quality
      • Spatial frequency and luminance masking
      Motivation Methods Results p 11 p 22 p 12 p 21
    • 26. A Strategy for High Quality
      • Contrast and luminance masking
      Motivation Methods Results
    • 27. A Strategy for High Quality
      • Contrast and luminance masking
      Motivation Methods Results
    • 28. A Strategy for High Quality
      • Contrast and luminance masking
      Motivation Methods Results
    • 29. A Strategy for High Quality
      • Contrast and luminance masking
      Motivation Methods Results
    • 30. A Strategy for High Quality Visibility Map LMSE Map Combine Maps Collapse with two- norm Read Images Motivation Methods Results
    • 31. A Strategy for High Quality Visibility Map LMSE Map Combine Maps Collapse with two- norm Read Images Motivation Methods Results
    • 32. A Strategy for High Quality Visibility Map LMSE Map Combine Maps Collapse with two- norm Read Images Motivation Methods Results
    • 33. A Strategy for High Quality Visibility Map LMSE Map Combine Maps Collapse with two- norm Read Images Motivation Methods Results
    • 34. A Strategy for High Quality Visibility Map LMSE Map Combine Maps Collapse with two- norm Read Images PD high Motivation Methods Results
    • 35. Methodology Low Quality
    • 36. A Strategy for Low Quality
      • Defining appearance:
        • Biological motivation: the log-Gabor filter bank [Field 1987, Kovesi]
      Motivation Methods Results
    • 37. A Strategy for Low Quality
      • Defining appearance:
        • Biological motivation: the log-Gabor filter bank [Field 1987, Kovesi]
        • Five Scales
        • Four Orientations
      Motivation Methods Results
    • 38. A Strategy for Low Quality
      • Gather appearance based upon statistics of G so
      • Statistics have been used to model
        • Animal Camouflage [Larson, Chandler 2007]
        • Texture Appearance [Kingdom, et al. 2003]
            • Variance, Skewness, and Kurtosis
      Motivation Methods Results
    • 39. A Strategy for Low Quality
      • Gather appearance based upon statistics of G so
      • where w s = [0.5, 0.75, 1, 5, 6]
      Motivation Methods Results
    • 40. A Strategy for Low Quality
      • Gather appearance based upon statistics of G so
      • where w s = [0.5, 0.75, 1, 5, 6]
      Motivation Methods Results
    • 41. A Strategy for Low Quality
      • Gather appearance based upon statistics of G mag,so
      • where w s = [1, 2, 6, 10, 12]
      Gabor Filtering Patch Statistics Collapse with two- norm Read Images Motivation Methods Results
    • 42. A Strategy for Low Quality
      • Gather appearance based upon statistics of G mag,so
      • where w s = [1, 2, 6, 10, 12]
      Gabor Filtering Patch Statistics Collapse with two- norm Read Images Motivation Methods Results
    • 43. A Strategy for Low Quality
      • Gather appearance based upon statistics of G mag,so
      • where w s = [1, 2, 6, 10, 12]
      Gabor Filtering Patch Statistics Collapse with two- norm Read Images Motivation Methods Results
    • 44. A Strategy for Low Quality
      • Gather appearance based upon statistics of G mag,so
      • where w s = [1, 2, 6, 10, 12]
      Gabor Filtering Patch Statistics Collapse with two- norm Read Images Motivation Methods Results
    • 45. A Strategy for Low Quality
      • Gather appearance based upon statistics of G mag,so
      • where w s = [1, 2, 6, 10, 12]
      Gabor Filtering Patch Statistics Collapse with two- norm Read Images PD low Motivation Methods Results
    • 46. What about in Transition?
      • In transition, we argue that both strategies are employed
      [12] Motivation Methods Results
    • 47. Adaptation
      • We can adaptively model their interaction based upon PD high
      • The final index is a weighted geometric mean
      Motivation Methods Results
    • 48. Results
    • 49. Results
      • LIVE [9] Quality Database:
        • 779 Distorted Images, 29 Original
        • 29 Observers
        • JPEG , JPEG2000 , Blurring, AWGN , and simulated packet loss
      • CSIQ ( Categorical Subjective Image Quality ) Database:
        • 10 original images, 300 distorted versions
        • 10 observers
        • Blur, contrast, AWGN , JPEG , JPEG2000 , APGN (1/f noise)
      Motivation Methods Results
    • 50. Results
      • LIVE Performance, All images
      Motivation Methods Results ALL PSNR SSIM VSNR VIF MAD CC 0.8707 0.9378 0.9233 0.9595 0.9695 SROCC 0.8763 0.9473 0.9278 0.9633 0.9703 R out 68.16% 59.18% 58.79% 54.56% 42.40%
    • 51. Results
      • LIVE Performance, All images
      Motivation Methods Results ALL PSNR SSIM VSNR VIF MAD CC 0.8707 0.9378 0.9233 0.9595 0.9695 SROCC 0.8763 0.9473 0.9278 0.9633 0.9703 R out 68.16% 59.18% 58.79% 54.56% 42.40%
    • 52. Results
      • LIVE Performance figures
      Logistic MAD Logistic VIF CC = 0.9695 CC = 0.9595 DMOS Motivation Methods Results
    • 53. Results
      • LIVE Performance figures
      Logistic MAD CC = 0.9695 DMOS Motivation Methods Results
    • 54. Results
      • LIVE Performance figures
      Logistic MAD CC = 0.9695 DMOS Motivation Methods Results
    • 55. Statistical Significance
      • LIVE Database, 99% confidence
      • 1 = better, 0 = same, -1 = worse
      Motivation Methods Results PSNR SSIM VSNR VIF MAD PSNR 0 - - - - SSIM 1 0 - - - VSNR 1 -1 0 - - VIF 1 1 1 0 - MAD 1 1 1 1 0
    • 56. Results
      • CSIQ Overall Performance
      Motivation Methods Results ALL PSNR SSIM VSNR VIF MAD CC 0.8455 0.8893 0.8472 0.9079 0.9487 SROCC 0.8428 0.9019 0.8577 0.9063 0.9469 R out 35.6% 30.5% 28.2% 33.9% 23.5%
    • 57. Results
      • CSIQ Overall Performance
      Logistic MAD CC = 0.9487 DMOS Motivation Methods Results
    • 58. Statistical Significance
      • CSIQ Database, 99% Confidence
      • 1 = better, 0 = same, -1 = worse
      Motivation Methods Results PSNR SSIM VSNR VIF MAD PSNR 0 - - - - SSIM 1 0 - - - VSNR 0 -1 0 - - VIF 1 0 1 0 - MAD 1 1 1 1 0
    • 59. Conclusion
      • Quality prediction algorithms can enhance performance by adaptively changing strategy
      • MAD performs significantly better than any other existing index on two databases
      • MAD shows promise in generalizing to a range of distortions
      • Multiple strategies
      Motivation Methods Results
    • 60. Thank You
      • Questions?
      Motivation Methods Results
    • 61. Thank You Motivation Methods Results
    • 62. References
      • B. Girod, What’s worng with the mean squared error?, pp207-240. MIT Press, 2 nd ed., 1993
      • T. Chen, Invited Lecture , Carnegie Mellon University, 2008 IEEE Southwest Symposium on Image Analysis and Interpretation.
      • S. Daly, “ Visible differences predictor: an algorithm for the assessment of image fidelity ,” in Proc. SPIE Vol. 1666, p. 2-15, Human Vision, Visual Processing, and Digital Display III , Bernice E. Rogowitz; Ed. (B. E. Rogowitz, ed.), vol. 1666 of Presented at the Society of Photo-Optical Instrumentation Engineers (SPIE) Conference , pp. 2–15, Aug. 1992.
      • D. Chandler and S. Hemami, “ Vsnr: A wavelet based visual signal to noise ratio for natural images, ” IEEE Transactions on Image Processing, vol. 16, pp. 2284 -2298, Sept. 2007.
      • Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “ Image quality assessment: from error visibility to structural similarity ,” IEEE Transactions on Image Processing, vol. 13, pp. 600–612, April 2004.
      • H. R. Sheikh and A. C. Bovik, “ Image information and visual quality, ” IEEE Transactions on Image Processing, vol. 15, pp. 430–444, Feb. 2006.
      • E. Peli, L. E. Arend, G. M. Young, and R. B. Goldstein, “ Contrast sensitivity to patch stimuli: Effects of spatial bandwidth and temporal presentation,” Spatial Vision, vol. 7, pp. 1–14, 1993.
      • G. E. Legge and J. M. Foley, “ Contrast masking in human vision ,” J. of Opt. Soc. Am., vol. 70, pp. 1458–1470, 1980.
      • Z. W. H. R. Sheikh, A. C. Bovik, and L. K. Cormack. Image and Video Quality Assessment Research at LIVE [Online]. Available: http://live.ece.utexas.edu/research/quality/ .
      • B. A. Olshausen and D. J. Field, “ Sparse coding with an overcomplete basis set: A strategy employed by v1? ,” Vision Research, vol. 37, pp. 3311–3325, Dec. 1997.
      • P. D. Kovesi.   MATLAB and Octave Functions for Computer Vision and Image Processing. School of Computer Science & Software Engineering, The University of Western Australia.   Available from: http://www.csse.uwa.edu.au/~pk/research/matlabfns/
      • Kingdom, F. A. A., Hayes, A. & Field, D. J. (2001) Sensitivity to contrast histogram differences in synthetic wavelet-textures. Vision Research, 41 , 585-598.
      • N. P. S. D. I. A. [Online].
      • “ Vqeg, final report from the video quality experts group on the validation of objective models of video quality assessment, phase ii,” August 2003 [Online]. Available: http://www.vqeg.org .
      • H. Sheikh, M. Sabir, and A. Bovik, “ A statistical evaluation of recent full reference image quality assessment algorithms, ” IEEE Transactions on Image Pro cessing, vol. 15, pp. 1349–1364, Nov. 2006.
      • Correlation, Wikipedia, http://en.wikipedia.org/wiki/Correlation
      • Regression Analysis, Wikipedia, http://en.wikipedia.org/wiki/Regression_analysis
    • 63. Results
      • CSIQ ( Categorical Subjective Image Quality ) Database
      • Preliminary Numbers:
        • 4 observers
        • 10 original images
        • 300 distorted versions
        • Six distortion types:
          • Blur, contrast, AWGN , JPEG , JPEG2000 , APGN (1/f noise)
      Motivation Methods Results
    • 64. Camp Two: Structure
      • SSIM captures Gaussian windowed spatial statistics
      • Collapse quality by taking mean of map
      Motivation Methods Results
    • 65. Camp Two: Information
      • VIF models mutual information by
        • Analyzing in wavelet domain [6]
        • Applying reference and distorted images to HVS model
        • Where C U is the principle image covariance of sub-band i ,
        • σ v is the distortion noise , σ n is the Visual noise variance ,
        • s and g are sub-band scaling constants
      Motivation Methods Results
    • 66. A Task for High Quality
      • Can we see the distortion?
      • How intense is the distortion?
      [9] Motivation Methods Results
    • 67. A Task for High Quality
      • Can we see the distortion?
      • How intense is the distortion?
      20.52 8.06 13.03 20.30 [9] Motivation Methods Results
    • 68. A Strategy for Low Quality
      • Gather appearance based upon statistics of G mag,so
      • where w s = [1, 2, 6, 10, 12]
      Motivation Methods Results
    • 69. Performance Measures
      • LIVE [9] Quality Database:
        • 779 Distorted Images, 29 Original
        • 29 Observers
        • Over 20,000 ratings of image fidelity, in terms of Differential Mean opinion Score ( DMOS )
        • Five categories of distortion:
          • JPEG , JPEG2000 , Blurring, AWGN , and simulated packet loss
      Motivation Methods Results
    • 70. Results
      • LIVE Performance figures
      Motivation Methods Results
    • 71. Results
      • LIVE Performance figures
      Motivation Methods Results
    • 72. Statistical Significance
      • In terms of Regression…
      [16] Motivation Methods Results
    • 73. Statistical Significance
      • Gaussian Residuals
      [16] Motivation Methods Results
    • 74. Statistical Significance
      • LIVE Database, 99% confidence
      • 1 = better, 0 = same, -1 = worse
      Motivation Methods Results ALL PSNR SSIM VSNR VIF MAD PSNR 0 -1 -1 -1 -1 SSIM 1 0 1 -1 -1 VSNR 1 -1 0 -1 -1 VIF 1 1 1 0 -1 MAD 1 1 1 1 0 Gaussian 1 0 1 0 1 Conf. 0.007 0.257 0.001 0.081 0.001 JB Stat 11.768 2.583 20.011 4.843 246.610 Skew 0.292 -0.139 0.091 0.170 -0.518 Kurt 3.143 2.957 3.764 2.818 5.554
    • 75. Statistical Significance
      • LIVE Database, Gaussianity
      Motivation Methods Results
    • 76. Statistical Significance
      • LIVE Database, Gaussianity
      Motivation Methods Results
    • 77. Statistical Significance
      • LIVE Database, Gaussianity
      J.B. Statistic = 1.5 Is Gaussian with greater than 95% confidence Motivation Methods Results
    • 78. Results
      • CSIQ ( Categorical Subjective Image Quality ) Database
      • Table top randomization approach
        • Reference image always available
        • Distorted images viewable at one time
        • Placement denotes linear quality
        • Electronic table (four monitor array)
      Motivation Methods Results
    • 79. Results
      • CSIQ ( Categorical Subjective Image Quality ) Database
      • Table top randomization approach
        • Reference image always available
        • Distorted images viewable at one time
        • Placement denotes quality
        • Electronic table (four monitor array)
      Motivation Methods Results
    • 80. Statistical Significance
      • CSIQ Database, 99% Confidence
      • 1 = better, 0 = same, -1 = worse
      Motivation Methods Results ALL PSNR SSIM VSNR VIF MAD PSNR 0 -1 0 -1 -1 SSIM 1 0 1 0 -1 VSNR 0 -1 0 -1 -1 VIF 1 0 1 0 -1 MAD 1 1 1 1 0 Gaussian 0 1 1 0 0 Conf. 0.500> 0.003 0.001 0.500> 0.425 JB Stat 0.414 17.619 24.255 0.652 1.560 Skew -0.053 -0.303 -0.569 0.084 0.108 Kurt 3.149 4.026 3.812 3.156 3.281
    • 81. Results
      • CSIQ Overall Performance, no contrast
      Motivation Methods Results PSNR SSIM VSNR VIF MAD CC 0.9178 0.9313 0.9499 0.9263 0.9637 SROCC 0.9185 0.9364 0.9478 0.9294 0.9594 RMSE 166.55 152.80 131.10 158.11 111.94 R out 0.344 0.316 0.240 0.312 0.236 R SOD 447.9 383.3 256.6 387.0 237.1