Efficient algorithms vs efficient code

Since I have spent quite a bit of time porting 25-year old code, I have been confronted with the significant changes to CPU architecture over that time. Code in DIPlib used to be very efficient back then, but some optimizations did not age well at all. I want to show some examples here. It is nice to see that a little bit of effort into optimization can go a long way.

I also want to give a quick example of a highly optimized implementation of an inefficient algorithm, which highlights the difference between algorithms and code.

Optimal cache usage

Everybody looking to process larger amounts of data has run into cache miss problems. For any sort of algorithm applied on a data set, the execution time as a function of data set size always has two distinct slopes, with a break when the data set size first exceeds the cache size. Execution speed greatly diminishes at this point. For such large data sets, it is very important to pick the algorithm that best uses the cache, which always involves reading data in storage order, or most frequently accessing contiguous portions of the data.

The old DIPlib (the versions up to and including the current official release) uses a method of reading and writing pixel data that made a significant speed difference on the architectures that existed 25 years ago. My colleague Geert van Kempen even wrote a paper about this method. Very simply put, if one processes an image by rows first, and then by columns (as one does with any separable filter), this method will write the rows as columns during the first step. The second step then again reads rows and writes columns. In effect, the image was transposed in each step. The advantage stems from the filter then always reads image rows (which are stored contiguously), and thus optimally uses the cache. This method is no longer used in DIPlib 3.0, as any advantages in cache usage are now insignificant due to changes in cache management. Because even though the method uses the cache optimally for reading, it uses it in the worst possible way for writing, which apparently is not what modern processors are optimized to do.

This is to say, explicit optimizations for one specific hardware system can slow down execution on different hardware. I usually prefer keeping code simple and straight-forward. This has several advantages: shorter code leads to fewer cache misses when loading instructions; simple code is easier to optimize for the compiler; and most importantly, it’s easier to read and maintain.

Conditional jumps

Another obvious difference in hardware from 25 years ago to now is pipelining. Back in the day, an “if” statement (which leads to a conditional jump in machine code) took one clock cycle on most hardware, and a floating-point multiplication took several clock cycles. An “if” statement was relatively cheap. Today, all processors use what is called a pipeline: an instruction that takes 8 clock cycles can be started every clock cycle, filling up a pipeline. If code can be written such that the result of one operation is independent of the result of the previous operation, then this second operation can be started on the same pipeline before the previous operation finished. It is possible to compute 1000 8-clock-cycle operations in 1007 clock cycles. However, a conditional jump throws a wrench into this: if the jump depends on the result of a computation, the processor has to wait for the computation to be finished to know where to jump to (or rather, whether to jump or not). Modern processors use branch prediction to determine what the most likely outcome is, and start computing towards that outcome. But if the prediction was wrong, things have to be rewound and one loses a lot of time. Thus, in modern processors, one tries to avoid “if” statements within loops.

The old DIPlib actually has quite a bit of code like this:

   for( ii = 0; ii < N; ii++ ) {
      out[ ii ] = option ? <do something> : <do something else>;
   }

By changing such code to be a little bit more verbose, one can gain a significant speedup:

   if( option ) {
      for( ii = 0; ii < N; ii++ ) {
         out[ ii ] = <do something>;
      }
   } else {
      for( ii = 0; ii < N; ii++ ) {
         out[ ii ] = <do something else>;
      }
   }

That is, the conditional jump is outside the loop. Even though option is a constant, and branch prediction does a terribly good job in this case, leaving the condition outside the loop is significantly faster.

Some timing differences

The following graph shows timing differences between several neighborhood filters in the old and new incarnations of DIPlib. The algorithms are identical, but code has been tweaked and simplified.

image: Timing for several neighborhood filters [your browser does not support SVG]

One can see that for some filters, the slope has become less steep (that is good!), whereas for others it has become more steep (bad) (do note that this is a log-log plot). There is an obvious improvement in the Kuwahara filter implementation that shows here, but for other filters (uniform and varif) I do not understand the difference. It is possibly caused by differing compiler settings. The graph also shows a smaller offset for most filters. This is an improvement in the loop over the image. Not shown here, these filters all run in parallel now (in the old DIPlib, this particular class of filters was not yet parallelized).

In a similar vein, the FIR implementation of the Gaussian filter (which is quite important in DIPlib) shows both a very large improvement for small filters (i.e. looping over the image is much faster), but a much steeper slope (i.e. the inner loop, the computation that is done at each pixel, is slower). I have poked for hours at this code, I cannot see why this inner loop should be slower than in the old code. Note that this plot is log-linear, this is why the lines are curved; the complexity of the algorithm is linear in sigma.

image: Timing for several neighborhood filters [your browser does not support SVG]

In the graph above, the difference between the blue and red lines indicate the speedup for multi-threading. I have four cores on this machine. The following graph shows these speedups for both versions of the library. The new version of the infrastructure was set up from the start to do multi-threaded computation, whereas in the old library, OpenMP was retro-fitted. This shows in the better speedup:

image: Timing for several neighborhood filters [your browser does not support SVG]

Optimizing the wrong algorithm

I am much more interested in good algorithms than in highly optimized code. A good algorithm reduces the complexity of the operation, leading to very large speed improvements. No matter how well you can optimize code, if you start with the wrong algorithm, your performance will suffer. I did a quick comparison of DIPlib implementations of the dilation (different structuring element shapes use different algorithms) to the implementations in the MATLAB Image Processing Toolbox. The IPT has highly optimized code, no doubt using single-instruction multiple-data (SIMD) features of modern processors and so on. However, when one looks at the timings, it is obvious that they optimized the wrong algorithms:

image: Timing for several neighborhood filters [your browser does not support SVG]

The square structuring element is decomposed into two 1D operations (it’s separable, much like the Gaussian filter). Each of these 1D operations can be computed in O(1) time. That is, independently of the length of the SE. Both old and new DIPlib use this algorithm, and consequently show a quite flat curve. The new implementation does away with boundary extension (which does depend on the length of the SE) and so its curve is even flatter. The IPT implementation, however, is not at all flat, and one wonders what algorithm they used (it doesn’t grow linearly with the SE size, so it’s not a naïve algorithm). It is also strange how multi-threading (the dotted lines) doesn’t seem to affect its timing much.

The diamond-shaped structuring element can also be decomposed into two 1D operations, but a third dilation, with a unit-sized diamond, is required. The old DIPlib did not implement it this way, instead using the same algorithm used for general SE shapes. The IPT also decomposes it into lines, but their implementation of the line SE is wonky (as we saw before). (Also, the new DIPlib implementation is wonky for size 7, but that’s a story for another day).

image: Timing for several neighborhood filters [your browser does not support SVG]

Finally, the most expensive operation is for shapes that cannot be easily decomposed into simpler operations. DIPlib uses an algorithm that decomposes these shapes into many horizontal lines. These lines are not processed using the O(1) 1D operator, because it would require too many intermediate images (one for each line length). But the algorithm is still better than the naïve implementation, which is O(k), with k the number of pixels in the SE. It is clear from the graph below that the naïve algorithm is what the IPT implements:

image: Timing for several neighborhood filters [your browser does not support SVG]

In closing

The DIPlib 3.0 code is available on GitHub. Poke around, let me know what you think, contribute!

5 Responses to “Efficient algorithms vs efficient code”

  1. On October 22nd, 2017, at 0:41, Royi said:

    Hi,

    Could you elaborate on the new implementation of the Gaussian Blur filter (Or basically the Separable Filter)?

    It would be nice reading and learning about it.

  2. On October 22nd, 2017, at 15:30, Cris said:

    Royi,

    Really quickly: The separable convolution in DIPlib is based on the separable framework (documentation, code). The framework calls a line function for each line in the image. It will copy the line to a temporary buffer, and add pixels on either side (boundary extension) so that the line function doesn’t need to do anything special for the pixels close to the image edge. The output of the line function is then written directly to the output image. The line function for the separable convolution is pretty straight-forward (see the code, lines 133-225). The Gaussian smoothing filter is an even (symmetric) filter (2nd case), so we can make use of that evenness to save on some multiplications.

    The framework is all new code, but it follows the concepts and ideas of the original DIPlib code. The line filter is slightly modified from the original code, the biggest difference is that the switch statement used to be within the loop over the pixels in the line, now there is one loop within each switch case.

    Maybe I’ll write a more detailed description some day.

  3. On October 23rd, 2017, at 8:31, Royi said:

    Hi,
    It will be really nice to have a post on the Gaussian Blur.

    I wrote a function for Separable Convolution (See https://codereview.stackexchange.com/questions/172157).
    From my experience, you better not copy to temporary array but deal with the boundaries specifically.

    I used the classic method of reading rows and writing columns.
    I wonder if, according to what you say, on modern CPU’s one should do it differently.

    Thank You.

  4. On October 23rd, 2017, at 21:04, Cris said:

    Royi,

    When computing the convolution, you read data from the same line many times (especially for larger kernels). If these data are contiguous in memory this process uses the cache more efficiently. Thus, it is advantageous to do one of the two things: reading rows and writing columns, or copying the line into a temporary buffer.

    I think the timing differences I observed with reading rows and writing columns is because you can’t do the operation in place: you cannot process one line of the image and write it back into the same memory, you need a new memory block. For a 2D image, you need both a temporary intermediate image and an output image. For higher dimensionalities, you need two temporary intermediate images. If you don’t do the transposition, you can read and write into the same image. And writing into the same image means less pressure on the cache, especially for larger images.

    And sure, not copying lines saves some time, but if you need to implement all the different possible boundary extensions into all the filters that you implement, it’s worth while the little extra cost of copying the data, and keep each of the filter functions as simple as possible. Plus, to work in-place you need to copy the line anyway.

    Did you measure a large time difference when copying lines of data to a temporary buffer?

  5. On October 24th, 2017, at 11:25, Cris said:

    I was just pointed to this particular commit to the GCC C library. Someone deleted the hand-written assembly for the logf function, reinstating the plain-old C source code for it. On some architectures latency was improved by 80% and throughput by 240%! This is a good example of outdated optimizations: the assembly might have been efficient 20 years ago, but it’s not optimal for modern platforms. Let the compiler generate the assembly, so when architectures change, you don’t need to rewrite your software.

    Also, it’s clear that compilers are much better at writing assembly than we are.

Leave a Reply

You can use these HTML tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

Note: I moderate all comments. Comments without a clear relation to the text above will not be published.