| Home > Publications database > Automatic Checkpointing of NQS Batch Jobs on CRAY UNICOS Systems |
| Contribution to a conference proceedings/Contribution to a book | FZJ-2015-02614 |
;
1993
Cray User Group
Please use a persistent id in citations: http://hdl.handle.net/2128/11839
Abstract: In most UNIX systems long running application programs are not protected against the loss of their accumulated CPU time in case of regular shutdowns or system crashes. In contrast to these systems, the UNICOS operating system provides a checkpoint/restart facility, which allows e.g. to recover NQS batch jobs after a regular system shutdown and reboot. However, there is still no function, which periodically performs checkpointing of running processes. This kind of checkpointing, which would minimize CPU time losses in case of system crashes, is completely left to the user. Unfortunately, most of the users do not care about checkpointing. Therefore, a feature was developed at KFA, allowing to checkpoint NQS batch jobs automatically after a certain CPU time interval. The key issue of this feature is a UNIX daemon which is activated together with each NQS request. We present a detailed description of the daemon and its user interface. Our experience in a production environment shows, that the CPU time losses due to system crashes can be drastically reduced by this feature.
|
The record appears in these collections: |