lamssi_cr(7)
NAME
LAM SSI checkpoint / restart - overview of LAM's MPI checkpoint /
restart SSI modules
DESCRIPTION
The "kind" for checkpoint / restart SSI modules is "cr". Specifically, the string "cr" (without the quotes) is the prefix that should be used with the mpirun command line with the -ssi switch. For example:
mpirun -ssi cr blcr C my_mpi_program
- LAM/MPI can involuntarily checkpoint and restart parallel MPI jobs.
Doing so requires that LAM/MPI was compiled with thread support and
that back-end checkpointing systems are available at run-time. MPI
jobs will have to run with at least MPI_THREAD_SERIALIZED support. If
a job elects to run with checkpoint/restart support and an available cr
module is found, the job's thread level will automatically be promoted
to MPI_THREAD_SERIALIZED. See the User's Guide for more details.
- Checkpoint Phases
- LAM defines three phases for checkpoint / restart support in each MPI process:
- Checkpoint.
When the checkpoint request arrives, before the actual checkpoint occurs.
- Continue.
After a checkpoint has successfully completed, in the same process as the checkpoint was invoked in.
- Restart
After a checkpoint has successfully completed, in a new / restarted process.
- The Continue and Restart phases are identical except for the process in which they are invoked -- the Continue phase is invoked in the same process as the Checkpoint phase was invoked. The Restart phase is only invoked in newly restarted processes.
AVAILABLE MODULES
- LAM currently has two cr modules: blcr and self. In order for an MPI
job to be able to be checkpointed and restarted, all of its MPI SSI
modules must support checkpoint/restart. Currently, this means using
the crtcp RPI module or the gm RPI module when compiled with gm_get()
support (see the User's Guide for more details).
- blcr CR Module
- The Berkeley Lab Checkpoint/Restart (BLCR) single-node checkpointer is a software system from Lawrence Berkeley Labs. See the project web page for more details: http://www.nersc.gov/research/ftg/checkpoint/.
- The blcr module has one SSI parameter:
- cr_blcr_priority
blcr's default priority is 50.
- self CR Module
- The self CR module effectively allows application-level checkpointing by invoking user-specified functions at the Checkpoint, Continue, and Restart phases of LAM/MPI C/R support.
- Multiple SSI parameters are available:
- cr_self_user_prefix
Specify a string prefix for the name of the checkpoint, continue, and restart functions that should be invoked by LAM. That is, specifying "-ssi cr_self_user_prefix_foo" means that LAM expects to find three functions at run-time: foo_checpkoint(), foo_continue(), and foo_restart(). This is a convenience parameter that can be used instead of the three parameters listed below.
- cr_self_user_checkpoint
Name of the user function to invoke during the Checkpoint phase.
- cr_self_user_continue
Name of the user function to invoke during the Continue phase.
- cr_self_user_restart
Name of the user function to invoke during the Restart phase.
- If none of these parameters are specified and the self module is selected, it will abort. Finally, the usual priority SSI parameter is also available:
- cr_self_priority
self's default priority is 25.