Evaluating noise filters

Most of the new papers that I come across that propose a new or improved way of filtering out noise from images use the Peak Signal-to-Noise Ratio (PSNR) as a means to evaluate their results. It has been shown again and again that this is not a good way of evaluating the performance of a filter. When people compute the PSNR for a filtered image, what they actually do is compare this filtered image to an undistorted one (i.e. known ground truth). This is very different from what the name PSNR implies: the ratio of peak signal power to noise power. Of course, that is something that cannot be measured: if we’d be able to separate the noise from the signal and measure the power of the two components, then we wouldn’t need to write so many papers about filters that remove noise! So instead, the typical PSNR measure in image analysis uses the difference between the filtered image and the original (supposedly noise-free) image, calls this difference the noise, and computes its power (in dB):

a = readim('cameraman')
b = noise(a,'gaussian',20)
c = bilateralf(b,2,30)
c_psnr = 10*log10(255^2/mean((a-c)^2))
c_psnr =

input imagenoisy imagefiltered image

The problem is that this does not compute the amount of noise in the image, and therefore is misleading. If the filter affects the signal as well as removing noise (which is always the case), both these changes will be taken as noise. What is more, this method assumes that we know what the image looks like without noise. If we use, as I did above, a natural image as a base for testing our filter, then we do not know what that image looks like without noise (all natural images have noise!). Sure, we added noise to the image and watched the filter reduce this noise. But we cannot expect our filter to know the difference between the noise we added to the image, and the noise that was already there. If we had a perfect filter that removed all the noise without affecting the signal at all, its result would be different from the original image (which has some unknown small amount of noise), and therefore we’d be underestimating the filter’s performance!

In my opinion, if what we really want to do is compare the filtered image with the original “noise-free” image, we should simply use the Mean Square Error (MSE). The MSE is the interesting part of the computation for PSNR, doesn’t make assumptions about the maximum signal power, and doesn’t claim to compute something we’re not computing:

c_mse = mean((a-c)^2)
c_mse =

If you compare this value to the MSE of the noisy image b (404.0), you get a very good idea of how much the filter undoes the noise we added to the image. But of course we do also need to make sure the input image really is noise-free for this to be correct, and the only way of doing so is by using a synthetic image.

To really characterise the performance of a noise filter, one should not simplify the comparison to a single number: part of the difference between the images is noise that wasn’t removed, part of the difference is distortion added to the image. And these two effects are not equal. For example, the two images below have the same MSE (and, consequently, the same PSNR):

b = gaussf(a,3)
b_mse = mse(b,a)
c = noise(a,'gaussian',sqrt(b_mse))
c_mse = mse(c,a)
b_mse =

c_mse =

image bimage c

I don’t believe these two images should get the same “score,” the noisy one looks better to me!

The main complaint against the MSE and similar measures for error is that they’re too “simplistic.” For example, adding an offset to the image (which doesn’t affect its signal-to-noise ratio at all), has a huge effect on the MSE:

d_mse = mse(a+sqrt(b_mse),a)
d_mse =

So how can we more properly study the effect of a noise-reduction filter? One way is to look at what the filter removed. How much signal is in there? Ideally, there would be none. All the pixels should be independent of its neighbours. If so, all that the filter removed was noise, without affecting the signal at all. Of course, this is not possible, so typically you’d see some part of the signal in there. This signal has some autocorrelation (i.e. the pixel values are not independent of each other). We can look for autocorrelation in the difference between the input and output of the filter:

N1 = autocorrelation(a-b);
N2 = autocorrelation(a-c);

autocorrelation of bautocorrelation of c

I’m showing only the central part of the autocorrelation, as there typically is no correlation between points further away. Clearly, image b has a disturbed signal, therefore a-b has strong autocorrelation. For image c this is not the case. One simple way to quantify the autocorrelation is to take the root mean square value of the autocorrelation at short distances (excluding the distance 0, of course):

m = rr(a)<2 & rr(a)>0;
b_ac = sqrt(sum(N1(m)^2))
c_ac = sqrt(sum(N2(m)^2))
b_ac =

c_ac =

As we can see, the one has a much larger value than the other, as expected. Ideally, we would normalize the input to the autocorrelation function, because stronger noise will have a larger value in this measure. The autocorrelation measure, in combination with the standard deviation of the difference (which is equivalent to the MSE), will quantify both the amount of noise and the amount of signal removed from the image. I’d rather give both figures separately, as it is application dependent how these two filter effects should be weighted.

A different approach is the Structure SIMilarity index (SSIM), which tries to provide a measure for how perceptually similar two images are. It also tries to summarize differences with a single number, but does so by measuring three different similarity cues, and combining them with empirically determined weights. There’s quite a few resources relating to SSIM on the inventor’s website. On our two images we get:

ans =

ans =

And we see that the noisy image has a lower value, which I don’t agree with (as I mentioned above). And that is exactly the problem with perceptual measures: not everybody agrees on what is better! In any case, there are many, many variants on the SSIM theme, the nice thing about SSIM is that it is so simple.

What is you take on this? Is there a better way of evaluating a noise-reduction filter?

10 Responses to “Evaluating noise filters”

  1. On September 25th, 2012, at 12:43, Matteo said:

    Hi Cris 1

    I really enjoyed reading this post. 

    As a geoscientist I like to try image processing techniques in geophysical applications, and with that in mind, here is my take on evaluating noise filters. 

    One application of image processing  in geophysics is the removal of acquisition footprint from

    reflection seismic data. This footprint noise appears as modulations in the recorded seismic

    amplitudes that are spatially correlated to the surface locations of dynamite sources and geophone receivers used in the survey. There is a good example in here:


    Filtering of acquisition footprint is generally done in the Fourier domain, with rectangular notch filters in the simplest implementations. In the example above the authors used a more sophysticated technique derived directly from image processing.

    The evaluation of filter success is done by looking at the image of filtered noise in the spatial domain. The processing geophysicist or seismic interpreter may simply rely on their experience and try to visually assess how much signal might there be in what was removed. Since what is imaged in this case is the geology in the subsurface, often this is done by simply answering the question: “is there anything that looks geological in there?” But that can be very subjective and I like to quantify performance more than that.   

    With some modeling work I will be publishing soon I tried MSE but concluded, as you do here, that a single number is not a satisfactory  way of quantifying the goodness of the filter result. I have one case where MSE is nearly identical for  three filters but I visually disagree. 

    I prefer using a comparison of canny or sobel edge detection on the noise free model  input versus edge detection on filtered result. I posted an example in my comment to a previous post of yours:


    In the case of seismic acquisition footprint the results are even more striking since the noise is often linear.

    I find the edge detection method is much better. But it is still to an extent subjective. I very much like your idea of using autocorrelation, and I will definitrly incorporate it in my modeling test to further, and more objectively evaluate this type of filtering results. I know some companies use the autocorrelation of the field acquisition as input to the filter design, I will have to look into their workflow as well. I’ll keep you posted when I publish on my blog

  2. On September 25th, 2012, at 13:02, Cris Luengo said:

    Hi Matteo,

    I was meaning to address your comment in this post, but sadly ended up forgetting to do so. One question, though, is how do you quantify the correctness of the edges you find? And can you distil the result to one or a few numbers that are easy to compare?

    The seismic data you work with is interesting in that the noise is correlated with the data (if I understand you correctly). The autocorrelation helps when the noise in each pixel is independent of the noise in other pixels, but might not be as useful in this case.

    Do keep me posted about this!

  3. On September 25th, 2012, at 21:00, Matteo said:

    Hi Cris
    Great discussion! 

    I will answer your questions on edge detection result in a separate comment. 

    Regarding the seismic noise example, good observation. I need to clarify what I was thinking as I leapt ahead and talked of using autocorrelation but without giving sufficient detail. Yes of course the footprint noise is related to the data, specifically to the field acquisition geometry.
    The next two images are low resolution copies – which I consider fair use – of two figures from this paper:
    Soubaras, R. [2002] Attenuation of acquisition footprint for non- orthogonal 3D geometries. 72nd SEG Annual Meeting, Expanded Abstracts, 2142-2145

    The first is an example of acquisition geometry:

    The second is the autocorrelation of the noise where the acquisition pattern is clearly recognizeable:

    Soubaras used the autocorrelation as part of the filter design.

    Although I use a different method to design my filters, reading your post and recalling this paper got me thinking.

    In your example the filter performance is assessed by looking at the autocorrelation of difference  IN – OUT, and choosing the filter that gives the lowest value

    In the seismic case, I think I could still use autocorretation this way:

    Calculate the autocorrelation of data
    Calculate the autocorrelation of difference  IN – OUT
    Pick the filter for which the two are closest (quantified using RMS as you suggest). What do you think?

  4. On September 25th, 2012, at 21:06, Matteo said:


    Edit to my latter post:

    when I write above autocorrelation of data it should read autocorrelation of noise, which can be calculated from the data following Soubaras. Thanks. Matteo

  5. On September 26th, 2012, at 9:35, Cris Luengo said:


    Yes, I think that does make sense. And it sounds very interesting, I’ve never dealt with data of this type. Thanks for the explanation!

  6. On October 5th, 2012, at 21:03, Matteo said:

    Hi Cris

    You asked “… how do you quantify the correctness of the edges you find? And can you distil the result to one or a few numbers that are easy to compare?”

    First the easy answer, to your second question. No, I have not been able (just yet, I hope) to distill the result as you say. Perhaps something like connected component count might work?

    Answering your first question is not straightforward. To do that, I will use an example from my modeling work. I used King Tut’s cat scan from this 2005 press release by the Egyptian Supreme Council of Antiquities:


    Medical images are great for modeling because we know how any part of the human body should look like. Contrary to what happens with natural images, or seismic maps, with medical images even if we did not have the noise-free original, only a very noisy one, we can evaluate the goodness of the filtering result by just looking at it.

    I made a lower resolution copy of the second image in the press release and added some noise, simulating the kind of seismic acquisition footprint discussed in my previous comments. Then I tried to remove the noise with several custom filters. In here I uploaded two of my filtered result, after Sobel edge detection:



    I think it is very easy in this simplified view to establish which filter did a better job, far simpler than with the full filtered images prior to edge detection. For me it is the first result, because the edge at the top of the skull is more continuous.

    With seismic images of geological features things are less straightforward. Certainly experienced interpreters know how a meandering channel belt looks like:


    Sobel in my view would simplify the assessment by reducing the filtered image to the essential, making it easy to compare results of different filters. Say for example one filtered image showed the otherwise continuous channel edges is broken down in a small area by portions of unremoved linear noise. The edge instead appears intact in the other.

    This kind of assessment is far less subjective in my opinion than trying to spot if there is any, even the faintest hint, of something that looks like a meander in the removed noise in either case. But that opinion changed now that I know, from this post, how to make that call using autocorrelation.

    So, thank you!

  7. On October 6th, 2012, at 7:56, Cris Luengo said:

    Hi Matteo,

    Thanks so much for taking the time to write that down. I think this can be a very valuable technique. And I think it leads the discussion to a more general approach where you evaluate a noise reduction filter (or any other data correction technique, for that matter) by the measurement results you get from the filtered image. “Does my measurement become more precise or accurate when I filter the image this way?” You don’t see this type of analysis often enough in the scientific literature.

    Thanks for the interesting discussion!

  8. On January 30th, 2014, at 18:45, Sharu Jaiboy said:


    I am trying to calculate the PSNR of .dcm (uint16) images.I have got two different codes for this.But while calculating with these two codes I’m getting different values.Can youplease tell me which is the correct one..??

    The 1st one is:

    [M N] = size(I1);


    MSE = sum(sum((I1(:)-I2(:)).^2)) / (M * N);



    The other is:






    Thanking you in advance…

  9. On January 31st, 2014, at 23:22, Cris Luengo said:


    The difference seems to be the “peak signal power”, in one it is 2^16, in the other it is 2^32. If your maximum intensity has a value of 255, use the first one; if it is 65535 (16-bit data), use the second one.

    But, as discuss in this post, I see no point in using the PSNR. Why do you think you need it? Why not just report the MSE?

  10. On June 3rd, 2015, at 21:20, Cris Luengo said:

    I just found out about this other quality metric, FSIM. The paper shows that this method has (it seems to me, after very quick browsing) a good agreement with perceptual quality.


Leave a Reply

Your email address will not be published. Required fields are marked *

Note: I moderate all comments. Comments without a clear relation to the text above will not be published.

You can use these HTML tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre lang="" line="" escaped="" cssfile="">