Run in parallel? Or not?

Posted on Updated on

Though hidden in the source code and nowhere explicitly documented, TopoToolbox uses some MATLAB built-in parallelization techniques. The functions gradient8, curvature, and hillshade for example, use the function blockproc that comes with the Image Processing Toolbox. If you have the Parallel Processing Toolbox, blockproc will make use of several workers to process your data. This should be more efficient and faster, at least in theory.

In practice, and despite being embarassingly parallel, the parallel approach might not be as efficient. But it took me some time to realize this, partly due to MATLAB bugs, partly due to my own mistake.

Here are some issues that I came across:

1. Until release 2015b, blockproc was bugged (see bug report 1157095). Enormous memory requirements slowed down execution. The Mathworks offered a work-around, that solved this bug, but memory requirements were still high.

2. These memory requirements were due to a silly mistake that I made when writing the anonymous function that blockproc was about to process. Here is a code snippet from gradient8 that illustrates the problem:

fun = @(x) steepestgradient(x,G.cellsize,c);
G.Z = blockproc(DEM.Z,blksiz,fun,...
'BorderSize',[1 1],...

The steepestgradient function is a subfunction of gradient8 that calculates the gradient for each block. G is a GRIDobj that I preallocated. Now, what’s wrong. Well, I passed G.cellsize to the anonymous function and I guess that this caused that the entire GRIDobj was transferred to each worker thus creating a large memory overhead and a huge amount of transfer time between the workers.

I have now changed this code snippet to

csĀ  = G.cellsize;
fun = @(x) steepestgradient(x,cs,c);
G.Z = blockproc(DEM.Z,blksiz,fun,...
'BorderSize',[1 1],...

And this resolves the problem.

3. Now, is it really worth the effort to call blockproc in parallel mode. I tried a simple testcase. I used a relatively large DEM with 12477 rows and 21275 columns stored in single precision, and I determined the block size using the function bestblk.

blksiz = bestblk(size(DEM.Z),5000);

I then used tic and toc to time execution: With blockproc running in parallel, it took gradient8 13.03 seconds to return the gradient GRIDobj. Choosing a best block size of 2000 increased run time to 33.8 seconds. Without invoking blockproc, gradient8 took only 8.23 seconds. What’s the reason for this? My guess is that imerode, the function called by the function steepestgradient twice, has some inbuilt parallelism so that it just doesn’t benefit from additional parallelization. Observing physical memory usage using the task manager revealed that none of the approaches resulted in massive memory overhead.

Now let’s try another function: hillshade. Hillshade calls the function surfnorm which returns three grids same size as the DEM. I thus expect some memory issues when calculating surfnorm of a very big matrix that should be decreased by using blockproc. When I run this function without blockproc, I see a massive increase in memory demand up to 100% of my 24Gb available RAM. CPU usage goes down, probably because MATLAB is forced to swap data between the hard disk and the main memory. The elapsed time in this case is 161.2 seconds. And surprise, when invoking blockproc (block size 5000), the memory overhead of the function hillshade goes drastically down and the elapsed time decreases to 66.7 seconds. Smaller block sizes of 2000 and 500 increase run time to 89.1 and 104.4 seconds, respectively.

Now what did I learn. (i) Be aware of the amount of data that you transfer between the workers. (ii) The benefit of using blockproc depends very much on the algorithm that will be processed by each worker. Memory demanding algorithms benefit from using parallelization. However, that benefit highly depends on your computational resources and (iii) is modulated by the size of the blocks that you transfer to the workers.

I have now updated some GRIDobj methods that call blockproc and removed the bug that caused excessive data transfer between the workers. This works well on my system, but perhaps it won’t on yours. My recommendation is that you best tinker yourself with the options of when blockproc should be involved and, if, which block sizes should be used.

Do you have experience with using blockproc. Please let me know.