Skip navigation
NASA Logo, National Aeronautics and Space Administration
0

Previously I have discussed the importance of build time for developer productivity and how parallelism and optimization flags can significantly improve performance of the build.   Today I will examine other techniques within the context of a motivating example from modelE.

 

One of the developers had pointed out that for the chemistry+aerosol configuration of the model, just one particular file was taking a disproportionate amount of time to compile.   Worse, due to various dependencies, a majority of changes made to the model ultimately force recompilation of this file.     I decided to investigate to see what could be done to ameliorate the situation.

 

The first step is of course to instrument the build so that we have some hard data to analyze.   Since I wanted to do this a bit more systematically than I had in the past, I actually went to read the manual on the Unix "time" command so that I could produce a clean profile without the hand editing I have done in the past.    I settled with the following lines in the Makefile:

 

 

COMPILATION_PERF =~/compilationPerf.dat
%.o:%.f:
     /usr/bin/time -f %e, $< -a -o $(COMPILATION_PERF) $(F90)...
 
It is important to give the path to the time command because some shells provide their own time command with a different syntax.   The build will now append 1 line to my dat file for each file that is compiled.  The line will have elapsed time (%e) in seconds and the file name separated by a comma for easy import into a spreadsheet.

Baseline Performance

 

With the changes to the Makefile accomplished, I made a fresh build of the model for the desired configuration and sorted the results:
File                                               

Compilation Time (sec)

TRACERS_DRV.f155.58

TRCHEM_master.f

28.73

DIAG.f25.70
DEFACC.f20.47
RAD_DRV.f18.56
RADIATION.f18.50
CLOUDS2.f13.39
CLOUDS2_DRV.f  8.03
GHY.f  7.14
DIAG_PRT.f  6.89
Total (139 files)427.65
Thus, just one file, TRACERS_DRV.f, was taking over 35% of the compilation time!

Drilling deeper

Taking a closer look at the expensive file, I found that it was just a collection of independent external subroutines.   Since there is no containing F90 module, it is a simple matter to split the file into ~10 separate files with one procedure in each one and then re-instrument.   I found the following:
Procedure                                     Compilation Time (sec)
INIT_IJTS_DIAG76.71
INIT_JLS_DIAG39.09
INIT_TRACER_CONS_DIAG

18.42

TRACER_IC13.88
INIT_TRACER  2.66
SET_TRACER_2DSOURCE  2.01
TRACER_3DSOURCE  1.98
SET_DIAG_RAD  0.36
GET_COND_FACTOR  0.31
DAILY_TRACER  0.23
INIT_IJLTS_DIAG  0.20
GET_LATLON_MASK  0.17
TOTAL160.92
We can make several interesting observations from this data.   First, unsurprisingly a small minority of procedures is consuming the bulk of the compilation.    Second, simply by splitting the procedures into separate files, parallel compilation could in theory reduce the build time for this set of procedures by about 50%, though we could have hoped for more.   And finally, from the names, we might guess that the first 5 files in the list are all only involved during the  initialization phase.   (Not entirely obvious for TRACER_IC, but it is also an initialization procedure.)
Why does it matter that the expensive procedures are "only" initialization procedures?   The performance of initialization procedures is unimportant for practical purposes since they are only executed once whereas most procedures are executed once per iteration (or more).    As it turns out the amount of computational work in these procedures is quite small anyway with virtually no 3D loops - just lots and lots of logic.

Easing off on optimization

Thus, if we simply crank down the optimization for these top 5 offenders to -O0, we might significantly reduce the compilation time with virtually no consequence to the run performance.  And indeed we see the following:
Procedure                                     Compilation Time (sec)
INIT_IJTS_DIAG0.36
INIT_JLS_DIAG0.29
INIT_TRACER_CONS_DIAG

0.24

TRACER_IC1.16
INIT_TRACER0.22
As these results show, the performance bottleneck is completely obliterated by this minor change.   As it turns out, -O0 can actually be applied to the original file and still preserve bitwise identical results with the fully optimized version.   Although applying lower optimizations to the non-initializing procedures is perhaps suboptimal, we avoid introducing new files in the repository and various other complicating factors.

Analysis

But why were those routines taking so much time to compile?   If we look closely at their structure, these procedures largely consist of large SELECT CASE blocks (10's -100's of CASE statements) embedded within a loop.   By commenting out various sections of code I was able to show that the compilation time was superlinear with the number of CASE blocks.   However, splitting the original SELECT CASE into two statements had no impact. My theory is that because the various variables accessed in these structures are largely compile time parameters including the loop bounds, the compiler has a great deal of potential to completely unroll everything while looking for opportunities to optimize.    Intuitively I can see that no real simplification is possible, but the compiler is trying very hard.
Many of these bits of code will soon be replaced by a more general algorithm based upon associative arrays. That change will effectively eliminate this compilation bottleneck, though that was not the intended purpose.

Consequences?

And what impact does reduced optimization have on the overall execution performance of the model?     The baseline configuration required 138 seconds per simulated day.   The unoptimized configuration required only 136.1 seconds - a hair faster.   (No, we did not speed up the code, but merely have demonstrated that the noise in the performance is larger than the effect of the changes we are making.)

 

 

Conclusions

 

No doubt a wide variety of things can produce compilation bottlenecks.   However, as this article shows, one does not need to be an expert about compiler implementations to be able to successfully tackle these challenges.   Rather a simple systematic exploration of the situation will often lead to simple solutions.

 

For the particular case discussed here, I am a bit dissatisfied with the use of optimization flags as a permanent solution because future developers may inadvertently (or even purposefully!) alter the compilation flags for this file.   I can try to protect the situation with a comment in the Makefile, but a better solution would be a structural change to the code which eliminates the bottleneck while also improving other characteristics of software quality.   In this particular case such a change is already planned, though it is only coincidental for the purposes of this blog entry.

 



USAGov logo NASA Logo - nasa.gov