Skip navigation
NASA Logo, National Aeronautics and Space Administration

(Apologies to readers that are not heavily steeped in Star Wars trivia.)


As with the old joke about duct tape, CPP is like The Force in that it has a light side and a dark side, and it holds the universe (application) together.  The vast majority of large Fortran applications rely on some type of conditional compilation for specifying various configuration aspects.  And some, such as GISS modelE, rely so heavily on conditional compilation that it can be said to be part of the overall software architecture.   Conditional compilation, usually via some sort of source-code preprocessing, can yield a wide variety of benefits for scientific models, but there are also significant negative consequences that developers should be aware of.  As I have emphasized for many other topics in this series, scientific developers are often so immersed in existing practice that we fail to see the alternatives and more importantly the consequences of our implicit choice to continue such practices.


By far the most common form of conditional compilation in the Fortran community is the use of the C preprocessor (CPP) which provides nested if-else blocks that effectively mask entire blocks of source-code from the compiler.   However, other forms of conditional compilation are also is use such as m4 and sed as well as various Fortran-specific preprocessors such as COCO and Forpedo which are largey unused in our community, presumably for historical reasons.   A more subtle form of conditional compilation can appear in build scripts (e.g. Makefile) where some mechanism controls just which files are and are not compiled.   GISS ModelE uses a combination of CPP and Makefile controls to support a very large variety of configuration options.


Note that while CPP also provides macro expansion capabilities which have their own upsides and downsides, this article is only addressing the use of CPP for conditional compilation.



Rationale for conditional compilation

Because so many groups have independently decided to incorporate conditional compilation in their software, we are safe in concluding that compelling reasons exist for this choice or at least once did exist.   Before moving on to the negative consequences, we should remind ourselves of what these positive motivations are.  Here I briefly summarize the rationale for the most frequent uses.



Although Fortran 90 is itself very portable, many software applications must interface with external resources that may vary from platform to platform. For instance many compiler vendors provide optimized FFT and other numerical procedures that have unique interfaces.   Fortunately,  the growth of standardized interfaces (e.g. LAPACK, MPI, etc.) and publicly available portable packages (e.g. FFTW, ESMF, etc) has significantly eroded the need to customize Fortran applications to run on a given platform.   Nonetheless, when external interfaces are nonstandard, conditional compilation is a very defensible mechanism for minimizing complexity in an application.   Such issues can generally be isolated in well documented, heavily isolated portions of the software.



Most scientific models are intended to be usable for a variety of scenarios, some of which are significantly more computationally expensive than others.   When the differences in the scenarios can be expressed at high-levels in the software, then a simple run-time conditional (i.e. a vanilla Fortran "if" statement), can be very effective.   However, when the differences unavoidably appear in inner loops of the implementation, run-time conditionals may incur an unacceptable performance overhead.   In these instances, conditional compilation provides much of the same flexibility as the run-time conditional, but with far-superior consequences to performance. 


It is worth noting that compilation performance itself can also be improved by means of conditional compilation.



As with performance, some model scenarios may use far more memory than others.  In the extreme, there may be combinations of scenarios which use mutually exclusive data structures such that no one scenario needs as much memory as the superset of all possible combinations.    Conditional compilation can be used to reduce the memory footprint of an application when it is running a scenario that requires less memory than would otherwise be required to support all scenarios.   Although this use of conditional compilation was quite defensible in early versions of Fortran, this practice must now be weighed against using dynamic memory allocation which was introduced in Fortran 90.


Mutually Exclusive Implementation Alternatives

Quite commonly a model may provide alternative implementations for a given piece of functionality and no sensible meaning can be made of using more than one implementation in a given run.   For instance an atmospheric model might have 2 or more different dynamical cores that can be used, but only one dynamical core can be used for a given execution of the model.    With conditional compilation, all of the alternatives can be implemented with identical interfaces and hide this complexity from other portions of the model.     Modern capabilities such as frameworks (ESMF) and object-oriented language features can provide similar reductions in complexity and must be considered when choosing to use conditional compilation for this purpose.


Toggle for debugging/diagnostics

When an application breaks or produces unexpected results, developers often activate machinery which produces additional diagnostic data to help resolve the issue.   These extra data would be prohibitively expensive/distracting for an ordinary run, and are deactivated with conditional compilation. This use of conditional compilation is often expected to be temporary and/or only for the primary developer of that section of the code.



As with many good things, conditional compilation can be just fine in moderation.   The first bit of CPP that crept into a given model was probably very beneficial and paved the way to additional uses.   At some threshold, however, the indirect costs began to compare with the presumed benefits, but these costs were not immediately recognized.  It is unsurprising that most teams overshoot a reasonable balance, and turning back the clock is no trivial matter.  First let us examine some of the undesirable consequences of conditional compilation in isolation.


Limited code coverage

The primary problem with conditional compilation is the effective volume of "dead code" not seen by the compiler.  There are numerous aspects of this fairly obvious consequence.    Perhaps most important is the increased difficulty in making correct changes while extending/maintaining an application.   If our change violates code that is not being compiled, significant errors will go undetected until someone builds a configuration that uses those blocks of code.   Of course code that is not executed also has such risks, so conditional compilation is technically only exacerbating the situation.   Developers can choose to live life on the edge and hope that induced problems in dead code are minor, easily fixed, and not traced back to themselves.  Alternatively, they can be more cautious at the expense of compiling and running multiple configurations of the model to ensure all blocks of code are compiled and produce expected results.   The severity of this problem rises sharply with the volume of "unused" code and especially with the number of independent configurations required to cover all of the source code.


In an ideal world, we could compile all of the source code in one single executable and use different run-time options to ensure that all of the model is tested.  This might still be a complex process for full system tests, but it would be much faster than the potentially exponential number of compilations.   And with a healthy set of unit-tests, testing all of the components could ultimately be simple and fast.   But so long as conditional compilation is heavily used it is very difficult to approach this ideal in a systematic fashion.



Undecipherable code

The occasional short CPP conditional block does not generally impair understanding of a section of code.   However when the block becomes large - spanning more than one screen length and/or the conditionals become nested, code can become every bit as difficult to follow as traditional spaghetti code.   In extreme cases, it can take some serious concentration to even determine if a given line of code is executed for a given set of compilation settings.  This problem is in principle not much more severe than the analogous problems of long procedures and deeply nested conditionals.   However the usual rules for indentation of CPP conditionals do not even provide the usual visual cue's that help us follow an algorithm.


DRY  (Don't Repeat Yourself)

Although code duplication is not a direct consequence of using CPP, many developers seem especially prone to this problem when working with dense CPP conditionals.   Duplicated logic whether conventional Fortran or CPP is an unnecessary maintenance burden and typically makes code harder to follow than if the duplicated logic is properly encapsulated.



If we want to reduce our reliance upon conditional compilation, are there steps we can take?   Certainly.   Although complete elimination would be very difficult and unwarranted, persistent attention to this problem can bring a model back to sensible levels of conditional compilation.   Many of the same techniques that can be used to reduce complexity in standard Fortran can be applied to preprocessors as well.  How any specific usage should be addressed, largely hinges upon its underlying rationale.



As mentioned above, there is sometimes no other choice than to use conditional compilation to deal with different computing environments.  But even then, proper engineering of interfaces should enable restricting conditional compilation to very isolated (and well documented!) sections of the code.   Developers should look for every opportunity to use standardized interfaces and portable 3rd party libraries.    When such are not available, developers should attempt to isolate the nonstandard functionality in as few interfaces as possible.   Conditional compilation will then be restricted to just those interfaces.   If those interfaces are then implemented for each environment in a separate set of files, the makefile itself can be used to cleanly manage the conditional compilation and preprocessors disappear.


Note that we are generally not concerned about code-coverage when talking about portability.



When performance matters, there may well not be a good alternative to conditional compilation.  Certainly developers should check to be certain that the performance really is impacted by other options rather than just assuming that the performance consequences are unacceptable.   One strategy to reduce conditional compilation used for performance purposes is to pull the conditional up to a higher level by introducing duplication.   E.g. a triply-nested loop with a CPP conditional inside could be written as two triply nested loops with the conditional controlling which loop nest is activated.   At some level, the performance cost of switching to use of a run-time conditional (i.e. a standard Fortran "if-else" block) becomes negligible and the CPP can be eliminated.   This approach works best when there are really only two configurations in the critical section of code, and the duplicated section is relatively short.   Developers will need to weigh the costs of maintaining duplicate code sections against the advantages of simpler logic and better code coverage.



Of course the large memories on modern computers often make a mockery of concerns from earlier times and the rationale for conditional compilation to conserve memory may evaporate on its own.   If not, propagating dynamic allocation throughout an entire model is the recommended approach, though it may require a significant effort.   In our community, domain decomposition for parallel computing has largely already forced the use of dynamic allocation anyway.  



At this time a number of mechanisms are available to developers to avoid the use of conditional compilation to manage configurations.   One challenge may be to deal with namespace collisions when multiple options use the same name for analogous procedures and/or modules.     If the configurations can be encapsulated behind a small number of well-defined interfaces then run-time conditionals in the driver can be quite effective.   If the interfaces are more complex, then frameworks (e.g. ESMF) can enable seamless run-time configuration.   And, of course, the emergence of object-oriented features in Fortran 2003 now allows a simple mechanism to hide configuration logic from other portions of the model.


Optional Diagnostics

At the very least, this use of conditional compilation should be supplemented by moving the diagnostic logic into a subroutine.   Then CPP can be used to control whether that subroutine has any content.   Even then, though, developers should consider the use of a runtime diagnostic such that the diagnostic logic is maintained when referenced data structures are modified by other developers.


Although conditional compilation is here to stay, I believe that a concerted effort to limit the use is an essential element of a long-term strategy to produce maintainable well-tested software.



Many scientific models are effectively developed as a team of individuals.  By this I mean that each contributor uses their own personal coding style rather than a common/agreed upon set of conventions.    This lack of convention can become an impediment to productivity when many files are developed by multiple individuals with clashing styles.     When style is consistent and predictable, code is easier to read and understand, and new developers can become productive in a shorter period of time. Further, in some instances, consistent style can even aid in highlighting certain types of bugs.


A common practice in the software industry is to develop a coding standards document that establishes various conventions such as variable naming, indentation and spacing rules, and documentation requirements.   Such documents present an opportunity to push for best practices, but should also be sensitive to existing conventions - the changes must be worthwhile.    Of course, everyone must be willing to compromise on their personal preferences for the sake of the team.   The good news is that virtually all developers become rapidly accustomed to the new style, with only a short (1-2 week) period of awkward grumbling.


For scientific modelers some aspects of establishing a coding standard should have a different emphasis.   First, scientists are not generally interchangeable to the degree that programmers are.   For example the atmospheric chemistry component of a climate model is generally developed and maintained by a small team (possibly one individual) with expertise in atmospheric chemistry, and the radiation experts should have relatively little reason to modify the chemistry files.  This is in stark contrast to the commercial software industry where "domain experts" are frequently involved in collecting requirements but not in the actual coding.    In practice of course, development of scientific models tend to "leak" and changes that would be local under ideal circumstances often must propagate across multiple parts of the model.     Nonetheless, this strong sense of individual ownership is an understandable obstacle to creating a common set of conventions.   Another important concern is that the majority of scientific software developers (i.e. scientists) have had relatively little exposure to modern best practices from the software industry.    Some elements of a well-intentioned coding standard document may therefore seem arbitrary or sometimes even counter-productive when they conflict with other common weaknesses in the current implementation.   For example,  many commercial developers recommend indenting blocks of code by 4 spaces, but a typical science code may have regions of code that are indented over 6 levels deep.   Until such structural problems are remedied, a smaller indentation rule may be appropriate for a science code.    Similarly, naming conventions suggesting longer self-explanatory names may be unconvincing in a community using variables that were named in ancient versions of Fortran (which limited names to 6 characters) and which uses older editors that lack modern conveniences such as name completion.


Perhaps the largest concern of scientist-programmers that I have encountered is simply that they will be judged by their degree of conformance with any new standard.   A top-down rigid enforcement of new conventions is unlikely to succeed in the semi-academic environment within which such models are developed.   Instead, the standard should be held out as an ideal that will be approached gradually.   Initial implementation priority can be placed on a subset of the conventions that are the most important by a variety of criteria.


In the case of GISS modelE,  initial priority is being driven by a few issues that either have an immediate necessity or are difficult to reverse/change and should therefore be done "right" the first time.   An example of the former is that modelE will be switching from fixed-format to free-format source code to support refactoring under Eclipse-Photran.     An example of the latter is the requirement to correlate file names with file contents.  (E.g. module "Foo_mod" should be in file "Foo_mod.F90".)   Because renaming files under CVS is somewhat awkward, establishing and following a convention for naming of any new files (and modules/procedures) is very important.   To compensate for the pain of these transitions, the remaining near term priorities are conventions that are generally followed to some degree already.


Once the modelE coding standards document is complete and ratified by its developer community, I will post it on Modeling Guru.


Trusting the Software

Posted by Thomas Clune Nov 30, 2009

When we set out modify software we immediately encounter the possibility that we have introduced bugs (or, less colloquially, "defects").   Of course it would be nice if bugs would announce their presence, and sometimes they do in the form of segmentation faults or ridiculous values for physical quantities.  And even then, the actual bug may be only indirectly related to the symptoms. But all-too-often, bugs are more subtle in their consequences, and in the worst cases appear as minor deviations of the underlying physical model.


I want to make a very strong distinction here between testing scientific models from a strictly software vantage point versus scientific validation.   Some scientific models, perhaps most notably weather forecasting systems, are extremely well validated.   On a daily basis, millions of observations from ground, air, and space-based instruments are compared against model forecasts and immediately alert modelers to unacceptable deviations.   The quality of weather forecasting today is testament to the continued efforts to validate models.   When we test the software, however, we are looking for places where the code does not execute in the manner intended by the implementor.   Of course many software defects will also impact scienctific validation, and a code that has absolutely no defects can nonetheless produce terrible scientific results.    For most complex models the situation is alarmingly murky.    Big bugs (the ones that cause the code to crash or produce ridiculous answers) are mostly eradicated quickly, but more subtle bugs can hide within the variability of the full system.   In fact the tuning of physical parameters for one subsystem can sometimes unintentionally (or intentionally!) mitigate bugs in other subsystems.   The irony here is that a better implementation of a given subsystem may actually be rejected because it degrades the overall scientific accuracy of the model which has been tuned to the inferior implementation.


Most science models that I have worked with have extremely little formal testing associated with their development and maintenance.   In the more sophisticated models some form of full-system testing is done with some degree of regularity or automation.  Even then, the automated full-system tests are limited to "it ran to completion", "it got the same results as yesterday", and "the results are identical on varying numbers of processors".    As a consequence bugs are often discovered long after they are introduced and are typically more difficult to fix as the offending piece of code is less familiar and perhaps more solidly "embedded" in the source.    (I assume that routine tests of operational models such as for the national weather forecast are more sophisticated, but I have no direct experience with such operational groups.  In this article, I am therefore addressing models in more academic/research environments.   But I would not be too surprised that even in operational settings that automated testing is extremely limited with intensive manual "spot-checking" as the primary safety mechanism.)


Can we do better?


Assuming for the moment that there is substantial value in more thorough testing of our software, why is the state of testing in our community so rudimentary.    No doubt one big reason is simply that the full value is not recognized by the various individuals involved in the process.    And, even when the value is appreciated, robust implementation can be alarmingly difficult in my experience.   Consider some of the following issues:

Compare to what?

Perhaps the most daunting challenge to robust system testing of a complex model is the lack of a reference solution to compare against.   Comparing against results from an earlier build provides some assurance, but is of limited use as many intentional physical changes to the model will alter the results to some degree.   Specifying even a very loose tolerance or error-bound for variation under such changes is extremely difficult in practice.    We can perhaps vary some parameter and check that the results vary in the manner expected.  E.g. some message-passing implementations have been carefully designed to produce identical results independent of the number of processors (strong reproducibility).   Testing for such reproducibility verifies a relatively orthogonal aspect of the model.      In a similar manner, one could test a checkpoint-restart capability that is designed to give results that are identical to a run without a restart.

How to compare?

Even in the absence of reference solutions, and especially in their absence, effective system testing requires a robust, precise mechanism to compare two model solutions for differences.   Unfortunately many groups rely on text-based output "log files" which typically have limited floating point precision and represent only a fraction of the full state of the simulation.   Such a mechanism can be worse than no mechanism at all because it can provide a false sense of security when it fails to detect variations.   Developers in my organization recently encountered precisely this problem (working on an application from another group) and immediately stopped operations to create a more robust comparison tool.  They then had to backtrack their work to find the true point of origin of a bug that had been introduced.   (Fortunately the differences became large under certain conditions, or the problem may not have been detected until far later.)    Another important property of a good comparison tool is that it not be subtle about any issues that it detects.   If error messages blend in with a flurry of otherwise harmless diagnostic messages, developers may not notice the problems. An example of a very good comparison tool is the one used by the developers of GISS modelE.   By default that tool reports any bit differences, identifying the specific quantities that vary, and the size of the discrepancy.   When all arrays are identical, the output is only a few lines.

How to test everything?

Ideally our system tests should involve most-if-not-all of the source code that we use in the production version of the model.   Unfortunately this is rather difficult to achieve for most models for a number of reasons.


As alluded to above, complex scientific models often have numerous compile-time configuration options.  Typically this means that a rather large number of separate compilations and executions are required to exercise most of the source code.   In the worst cases this is an exponential function of the number of configuration options.   In practice, this means that testing is limited to a small number of configurations that are thought to be representative.   In the absence of test-coverage tools for Fortran, the full impact of this limitation can be difficult to measure.   A later article will discuss the various pains associated with compile-time configuration and techniques to mitigate them.     In the case of GISS modelE, there is a desire to maintain serial, OpenMP, and MPI implementations across a wide variety of scientific configuration options (different oceans, dynamical cores, tracers, etc.)    Each configuration requires full recompilation and execution.    Full coverage of all important configurations could easily require thousands of independent compilations.


Another related issue is that certain software components may only run at sporadic time intervals.   For example some diagnostic computations in GISS modelE occur on daily and monthly boundaries.    Thus, a full system test would need to integrate for a full month before those portions are exercised.     Similarly the checkpoint-restart mechanism requires longer (and now multiple) runs to test functionality.     From this vantage point, separate drivers that just exercise those subsystems would be a useful investment.  For such separate drivers to have validity, we must be confident that the functionality to be tested is orthogonal to other aspects of the implementation.


We may also wish to improve code coverage in a rather different sense by varying compiler options.   For example we may want to test with default optimizations to ensure behavior in the configuration used for research, but we may also want to compile with full debugging options to leverage the compiler's ability to detect problems that are otherwise often missed such as illegal memory references and uninitialized values. There are also operating system variations that could be considered for coverage.   E.g. the local mainframe provides a number of different Fortran compilers (versions and vendors) as well as MPI libraries.   A sensible strategy could be to test against the next version of a compiler/library while the older version is used for operations so that if an upgrade becomes necessary, the interruption would be minimal.



How to isolate where the bugs are?

Full system tests have almost no ability to isolate bugs - they just tell us whether there is a problem.   Fortunately, when such tests are run frequently, the amount of software change since the last successful test is often quite small.   Ideally, system tests should be supplemented with more granular tests, i.e. subsystem and unit tests.   In addition to providing improved defect isolation, such finer granularity tests can also at least partially address the issues raised above about what to compare results against.   For sufficiently small procedures, one should be able to specify the expected behavior. 



Routine testing requires a nontrivial amount of computing resources.   In the case of modelE, nightly tests compile 5 scientific configurations of the model for serial, OpenMP, and MPI for a total of 15 compilations.   Each of these is run for 1 hr and 1 day to test restart capabilities.  The parallel cases are also run on varying numbers of threads/processes.  In the case of MPI the largest supported number of processors is included in the list.   Fortunately this is still small compared to the full resources used by the model, but it is certainly a cause for concern.


Next steps


One of my priorities this year while I work with modelE is to introduce finer-grained pervasive tests.   In particular, I will incorporate a unit-testing framework known as pFUnit and begin to implement representative unit-tests for select areas of the model infrastructure.   Here, by infrastructure I mean the various bits of software that manage data structures, parallelism, and other such things whose behavior can be precisely defined.   At the subsystem level, I hope to develop separate drivers which will enable more efficient testing with greater coverage.    I will also work to eliminate compile-time configuration options in favor of run-time configuration options where possible.    All of these issues will be recurring themes in later articles.

USAGov logo NASA Logo -