Cris’ Image Analysis Blog

theory, methods, algorithms, applications

Implementing the convolution

I find myself regularly explaining how to efficiently implement a convolution (a linear filter). This blog post will combine all my tips and tricks regarding convolution. I’ll start with the 1D case, then will expand into the multi-dimensional case.

In the 1D case, I’ll use \(t\) as the variable, x(t) is a discrete input signal of finite length (a set of samples) and only defined for integer \(t \in [0, N)\) . We’ll consider \(y = x * h\) , where \(h\) is the kernel and \(y\) is the result of convolving the signal and the kernel. We care here only for the case where \(y\) has the same number of samples as \(x\) .

1D convolution

There are two different ways that a 1D convolution can be defined. There’s the Infinite Impulse Response (IIR) and the Finite Impulse Response (FIR) filters.

An IIR filter uses previous output values to compute the current output value. That is, it is recursive. For example,

\[ y(t) = a x(t) + b y(t-1) + c y(t-2) \quad .\]

To see what the impulse resonse of this filter is, we’d have to apply it to an impulse signal (a Dirac delta). If we were to do this, we’d find that the impulse resonse (or kernel) \(h(t)\) is zero for \(t<0\) , and non-zero for \(t>=0\) . That is, it is causal (it doesn’t depend on any future data), and it is infinite (it never really ends).

An IIR filter can also be anti-causal (does not depend on any past data), such a filter would be computed backwards in time.

A FIR filer uses only a finite set of input values to compute the current output value. For example,

\[ y(t) = a x(t-1) + b x(t) + c x(t+1) \quad .\]

A FIR filter can also be causal or anti-causal, but it can also be neither, if it uses past and future input samples.


Both the FIR and IIR filters can be implemented quite trivially, except for a few details we’ll discuss below. But most output samples we can compute simply by applying the equation that defines the filter. In the FIR case, each output value can be computed independently of others, and so is trivial to parallelize. The FIR filter, having a recursive definition, must be computed left to right (for the causal case) or right to left (for anti-causal case).

In some cases, there are some tricks worth thinking about. For example, if the FIR kernel is symmetric, the computations can be rearranged to reduce cost:

\[ \begin{split} y(t) &= a x(t-1) + b x(t) + a x(t+1) \\ &= a \left\{ x(t-1) + x(t+1) \right\} + b x(t) \quad . \end{split} \]

For longer kernels the savings can be significant.

Boundary condition

For both types of filter, we need to decide what happens when we need to read values outside the domain. In the FIR case, for \(t=0\) we need to fill in a value for \(x(-1)\) , which we do not know (there are many ways to choose these values, what is best depends on the application, we won’t go into that here). In the IIR case, we need to initialize \(y(t)\) for the set of values of \(t<0\) that we will need to read.

But what matters is how we determine whether we can read a given value or not. A conditional expression is very expensive in modern hardware. Code like

if t >= 0 {
   sum += x[t] * h[k];


sum += x[max(t, 0)] * h[k];

will significantly slow down execution. We want to avoid these tests for the bulk of the signal that is sufficiently far from the boundary, where we know for sure we can read all required input values. There are two ways of accomplishing this:

  1. Copy the input signal into a larger buffer (i.e. add padding), and fill in the samples outside the known signal in our desired way. We can now safely read x[-1] for example. In the IIR case this translates to creating a larger output buffer to write in and read from.

  2. Write three separate loops: the first one applies the filter for the first few output values where the boundary plays a role, a second loop processes the bulk of the signal and has no conditional statements, and a third loop produces the last few output values where the boundary again plays a role (in the skeleton code below, N is the number of samples in x and y, K is the number of samples in h, we assume K is odd and the origin of the kernel is in the center):

    int i = 0;
    int margin = K / 2; // floor division!
    for (; i < margin; ++i) {
       // compute convolution applying boundary condition
       // on the left side of the input
    for (; i < N - margin; ++i) {
       // compute convolution without any conditional statements
    for (; i < N; ++i) {
       // compute convolution applying boundary condition
       // on the right side of the input

[I would typically not use int for indexing or loops, but wanted to keep the code above simple.]

Fast Fourier Transform

A FIR filter can be implemented efficiently using the FFT only if the kernel is large. The FFT is relatively expensive, and not worth using for smaller kernels.

To use this method, pad the kernel h with zeros to the same size as the input x, apply the FFT to both, multiply them (remember the output of the FFT is complex, we must properly multiply the complex numbers), then compute the inverse transform of the result. The inverse transform should be real-valued, but the imaginary component will likely have some very small values in it due to floating-point rounding errors. Simply ignore the imaginary component, do not take the magnitude (absolute value) of the complex numbers.

When using the FFT, remember that the FFT takes the first sample as the origin. For the signal it doesn’t matter where the origin is, but for the kernel this is indeed very important. If we define the kernel with the origin in the middle, a careless transformation of the kernel will result in the output signal being shifted.

Assuming again a kernel h with K samples and the origin in the middle (at K/2, using floor division), we must pad the kernel as follows:

padded_h = cmalloc(N * sizeof(*padded_h)); // initialized to 0
memcopy(padded_h, h + K / 2, K / 2 + 1);
memcopy(padded_h + N - K / 2, h, K / 2);

The convolution is now applied with (pseudocode):

X = fft(x, N);
H = fft(padded_h, N);
Y = malloc(N * sizeof(*Y));
for (int i = 0; i < N; ++i) {
   Y[i] = X[i] * H[i];
y = ifft(Y, N); // if y is complex, keep the real part

But note that the boundary condition imposed by the FFT is periodic: \(x(N) = x(0)\) . This is usually not a desireable boundary condition. So we must pad the input before applying the FFT. We can pad x on both sides by K/2, using our desired method (h must be padded to this new length!), and crop the final result back to the original size. Since we’re padding anyway, we can choose a slightly larger size that will be cheaper to compute the FFT for. The FFT is most efficient when the length N can be decomposed into a product of small integers (typically 2, 3 and 5, maybe also 7 depending on the implementation we’re using).

nD convolution

The nD case is more or less the same as the 1D case. I don’t know of any nD IIR filters, so I’ll discuss FIR filters only. Applying the convolution requires looping over multiple dimensions, and we need to take care of boundary conditions in all dimensions. Again, the simple way is to copy the image into a larger buffer, adding pixels outside the original boundary. The more complex way would require 3n loops (for n dimensions): each of the three loops discussed in the 1D case would have three loops inside it for the second dimension, etc. Each inner loop handles a corner, and edge or the central part of the image. For the 2D case this might still be doable, but it becomes increasingly difficult for higher dimensions. The copy required to pad the image doesn’t sound so bad after all…

But there is an additional trick that comes into play in the nD case: some kernels can be decomposed into a set of 1D kernels.

Kernel decomposition

We know that the convolution is associative. That is, (x * h_1) * h_2) = x * (h_1 * h_2). This can be used in two ways:

  1. When applying multiple convolutions in sequence, we can combine the kernels together, and apply them all as a single convolution. This saves in memory accesses, since we’re looping over the image only once. In 1D, if one kernel has size K and the other has size M, then the combined kernel has size K + M - 1, so there is not really a computational advantage other than the more efficient memory access. When using the FFT, the core advantage is the fewer FFTs one needs to compute.

  2. If we can decompose a kernel into a set of 1D kernels, we can significantly reduce the computational cost. For example, a kernel of \(K \times K\) pixels takes \(K^2\) multiplications and additions to compute, but if we can decompose it into two 1D kernels of length \(K\) , then the cost is reduced to \(2K\) multiplications and additions. The larger \(K\) , the larger the savings. Kernels that can be decomposed like this are the square kernel with uniform weights, the Gaussian kernel and all its derivaties, the Sobel kernels, etc. These kernels are called separable.

However, applying two 1D kernels in succession does require iterating over the image twice, which adds to the cost. So very small kernels like Sobel are not worth decomposing.

The Gaussian filter

But the Gaussian kernel certainly is worth decomposing. Decomposing the Gaussian kernel makes it trivial to implement a Gaussian filter for an image with an arbitrary number of dimensions. We only need to implement the 1D Gaussian, then just loop over the lines along each of the image dimensions. The 1D Gaussian can be implemented as a FIR filter, as an IIR filter, and through the FFT. Which one to pick depends on the sigma: The FIR filter increases in cost with the sigma. The IIR filter is quite expensive, but independent of sigma. And so for sigma above some value, the IIR filter is the most efficient method. The FFT method is always the most expensive, but is useful when the sigma is so small that it cannot be implemented correctly using a FIR filter (the lower limit for sampling a Gaussian kernel is a sigma of about 0.8 pixels).

In closing

To know which implementation is faster, which tricks to apply, you must always measure. There are rules of thumb you can use, but each machine is different, and each situation is unique. Try different things and time them all!

This is a graph comparing the speed of the three methods to compute the Gaussian filter, as implemented in DIPlib, on my machine. We apply the filter to image of 2000x2000 pixels, for different sigma:

Graph comparing speed of different implementations of the Gaussian filter

On my machine, for a sigma of 10 pixels, the FIR and IIR filters are equally fast. It is never faster to use the FFT to compute a Gaussian filter. Would these conclusions be different on your machine?

Questions or comments on this topic?
Join the discussion on LinkedIn or Mastodon