Cris’ Image Analysis Blog | Efficient algorithms vs efficient code

Efficient algorithms vs efficient code

By Cris Luengo on Sat 21 October 2017.

Since I have spent quite a bit of time porting 25-year old code, I have been confronted with the significant changes to CPU architecture over that time. Code in DIPlib used to be very efficient back then, but some optimizations did not age well at all. I want to show some examples here. It is nice to see that a little bit of effort into optimization can go a long way.

I also want to give a quick example of a highly optimized implementation of an inefficient algorithm, which highlights the difference between algorithms and code.

Optimal cache usage

Everybody looking to process larger amounts of data has run into cache miss problems. For any sort of algorithm applied on a data set, the execution time as a function of data set size always has two distinct slopes, with a break when the data set size first exceeds the cache size. Execution speed greatly diminishes at this point. For such large data sets, it is very important to pick the algorithm that best uses the cache, which always involves reading data in storage order, or most frequently accessing contiguous portions of the data.

The old DIPlib (the versions up to and including the current official release) uses a method of reading and writing pixel data that made a significant speed difference on the architectures that existed 25 years ago. My colleague Geert van Kempen even wrote a paper about this method. Very simply put, if one processes an image by rows first, and then by columns (as one does with any separable filter), this method will write the rows as columns during the first step. The second step then again reads rows and writes columns. In effect, the image was transposed in each step. The advantage stems from the filter then always reads image rows (which are stored contiguously), and thus optimally uses the cache. This method is no longer used in DIPlib 3.0, as any advantages in cache usage are now insignificant due to changes in cache management. Because even though the method uses the cache optimally for reading, it uses it in the worst possible way for writing, which apparently is not what modern processors are optimized to do.

This is to say, explicit optimizations for one specific hardware system can slow down execution on different hardware. I usually prefer keeping code simple and straight-forward. This has several advantages: shorter code leads to fewer cache misses when loading instructions; simple code is easier to optimize for the compiler; and most importantly, it’s easier to read and maintain.

Conditional jumps

Another obvious difference in hardware from 25 years ago to now is pipelining. Back in the day, an “if” statement (which leads to a conditional jump in machine code) took one clock cycle on most hardware, and a floating-point multiplication took several clock cycles. An “if” statement was relatively cheap. Today, all processors use what is called a pipeline: an instruction that takes 8 clock cycles can be started every clock cycle, filling up a pipeline. If code can be written such that the result of one operation is independent of the result of the previous operation, then this second operation can be started on the same pipeline before the previous operation finished. It is possible to compute 1000 8-clock-cycle operations in 1007 clock cycles. However, a conditional jump throws a wrench into this: if the jump depends on the result of a computation, the processor has to wait for the computation to be finished to know where to jump to (or rather, whether to jump or not). Modern processors use branch prediction to determine what the most likely outcome is, and start computing towards that outcome. But if the prediction was wrong, things have to be rewound and one loses a lot of time. Thus, in modern processors, one tries to avoid “if” statements within loops.

The old DIPlib actually has quite a bit of code like this:

for ( ii = 0; ii < N; ii++ ) {
   out[ii] = option ? <something> : <something_else>;
}

By changing such code to be a little bit more verbose, one can gain a significant speedup:

if ( option ) {
   for ( ii = 0; ii < N; ii++ ) {
      out[ii] = <something>;
   }
} else {
   for ( ii = 0; ii < N; ii++ ) {
      out[ii] = <something_else>;
   }
}

That is, the conditional jump is outside the loop. Even though option is a constant, and branch prediction does a terribly good job in this case, leaving the condition outside the loop is significantly faster.

Some timing differences

The following graph shows timing differences between several neighborhood filters in the old and new incarnations of DIPlib. The algorithms are identical, but code has been tweaked and simplified.

Timing for several neighborhood filters

One can see that for some filters, the slope has become less steep (that is good!), whereas for others it has become more steep (bad) (do note that this is a log-log plot). There is an obvious improvement in the Kuwahara filter implementation that shows here, but for other filters (uniform and varif) I do not understand the difference. It is possibly caused by differing compiler settings. The graph also shows a smaller offset for most filters. This is an improvement in the loop over the image. Not shown here, these filters all run in parallel now (in the old DIPlib, this particular class of filters was not yet parallelized).

In a similar vein, the FIR implementation of the Gaussian filter (which is quite important in DIPlib) shows both a very large improvement for small filters (i.e. looping over the image is much faster), but a much steeper slope (i.e. the inner loop, the computation that is done at each pixel, is slower). I have poked for hours at this code, I cannot see why this inner loop should be slower than in the old code. Note that this plot is log-linear, this is why the lines are curved; the complexity of the algorithm is linear in sigma.

Timing for the Gaussian filter

In the graph above, the difference between the blue and red lines indicate the speedup for multi-threading. I have four cores on this machine. The following graph shows these speedups for both versions of the library. The new version of the infrastructure was set up from the start to do multi-threaded computation, whereas in the old library, OpenMP was retro-fitted. This shows in the better speedup:

Timing speedup with multi-threading

Optimizing the wrong algorithm

I am much more interested in good algorithms than in highly optimized code. A good algorithm reduces the complexity of the operation, leading to very large speed improvements. No matter how well you can optimize code, if you start with the wrong algorithm, your performance will suffer. I did a quick comparison of DIPlib implementations of the dilation (different structuring element shapes use different algorithms) to the implementations in the MATLAB Image Processing Toolbox. The IPT has highly optimized code, no doubt using single-instruction multiple-data (SIMD) features of modern processors and so on. However, when one looks at the timings, it is obvious that they optimized the wrong algorithms:

Timing for square dilation

The square structuring element is decomposed into two 1D operations (it’s separable, much like the Gaussian filter). Each of these 1D operations can be computed in O(1) time. That is, independently of the length of the SE. Both old and new DIPlib use this algorithm, and consequently show a quite flat curve. The new implementation does away with boundary extension (which does depend on the length of the SE) and so its curve is even flatter. The IPT implementation, however, is not at all flat, and one wonders what algorithm they used (it doesn’t grow linearly with the SE size, so it’s not a naïve algorithm). It is also strange how multi-threading (the dotted lines) doesn’t seem to affect its timing much.

The diamond-shaped structuring element can also be decomposed into two 1D operations, but a third dilation, with a unit-sized diamond, is required. The old DIPlib did not implement it this way, instead using the same algorithm used for general SE shapes. The IPT also decomposes it into lines, but their implementation of the line SE is wonky (as we saw before). (Also, the new DIPlib implementation is wonky for size 7, but that’s a story for another day).

Timing for diamond dilation

Finally, the most expensive operation is for shapes that cannot be easily decomposed into simpler operations. DIPlib uses an algorithm that decomposes these shapes into many horizontal lines. These lines are not processed using the O(1) 1D operator, because it would require too many intermediate images (one for each line length). But the algorithm is still better than the naïve implementation, which is O(k), with k the number of pixels in the SE. It is clear from the graph below that the naïve algorithm is what the IPT implements:

Timing for disk dilation