Contribution to a conference proceedings/Contribution to a book FZJ-2015-02614

http://join2-wiki.gsi.de/foswiki/pub/Main/Artwork/join2_logo100x88.png
Automatic Checkpointing of NQS Batch Jobs on CRAY UNICOS Systems

 ;

1993
Cray User Group

Proceedings of the Cray User Group Meeting, Spring 1993
Cray User Group Meeting, Spring 1993, MontreuxMontreux, Switzerland, 29 Mar 1993 - 3 Apr 19931993-03-291993-04-03
Cray User Group 250-255 ()

Please use a persistent id in citations:

Abstract: In most UNIX systems long running application programs are not protected against the loss of their accumulated CPU time in case of regular shutdowns or system crashes. In contrast to these systems, the UNICOS operating system provides a checkpoint/restart facility, which allows e.g. to recover NQS batch jobs after a regular system shutdown and reboot. However, there is still no function, which periodically performs checkpointing of running processes. This kind of checkpointing, which would minimize CPU time losses in case of system crashes, is completely left to the user. Unfortunately, most of the users do not care about checkpointing. Therefore, a feature was developed at KFA, allowing to checkpoint NQS batch jobs automatically after a certain CPU time interval. The key issue of this feature is a UNIX daemon which is activated together with each NQS request. We present a detailed description of the daemon and its user interface. Our experience in a production environment shows, that the CPU time losses due to system crashes can be drastically reduced by this feature.


Contributing Institute(s):
  1. Zentralinstitut für Angewandte Mathematik (ZAM)
  2. Jülich Supercomputing Center (JSC)
Research Program(s):
  1. 899 - ohne Topic (POF2-899) (POF2-899)

Database coverage:
OpenAccess
Click to display QR Code for this record

The record appears in these collections:
Document types > Events > Contributions to a conference proceedings
Document types > Books > Contribution to a book
Workflow collections > Public records
Institute Collections > JSC
Publications database
Open Access

 Record created 2015-04-14, last modified 2021-01-29