NSF Stories

Research computing team studies supercomputer reliability

Reliability required in areas such as infectious disease modeling

Researchers running demanding computations, especially for projects such as infectious disease modeling that need to be re-run frequently as new data become available, rely on supercomputers to run efficiently with as few failures of the software as possible.

Understanding why some jobs fail and what can be done to make supercomputers more reliable is the focus of a recent project led by Saurabh Bagchi, an electrical and computer engineer at Purdue University.

The project, which began almost five years ago and was supported by the National Science Foundation, analyzed data from supercomputer systems at Purdue as well as the University of Illinois at Urbana-Champaign and The University of Texas at Austin.

Among the conclusions:

  • Node-sharing doesn't translate to a higher rate of job failure.
  • Memory-intensive applications can fail even before the rated memory of the node is reached, which suggests that close monitoring of the memory usage of applications may be necessary.
  • Careful allocation and scaling up of "remote" resources (such as parallel file systems and network connections to storage systems) is important as a cluster grows in size.

Bagchi says these are practical takeaways that supercomputer systems administrators can implement to make applications run on their computers more reliably.

The team will present the findings at the upcoming Dependable Systems and Networks conference, to be held virtually in June.