Mpiexec is a replacement program for the script mpirun, which is part
of the mpich package.  It is used to initialize a parallel job from
within a PBS batch or interactive environment.

See the man page for detailed information.

Copyright (C) Pete Wyckoff, 2000-7.  <pw@osc.edu>

Installation instructions
-------------------------

1.  First, figure out what version of PBS you are using.  Torque is highly
    recommended, but you might also use its predecessor OpenPBS or the
    non-free PBSPro.

    Known good PBS versions include:

	Torque 1.2.0 through 2.1.6 and beyond
	    (http://www.supercluster.org/projects/torque/)
	OpenPBS-2.3.11 through .16  (http://www.openpbs.org)
	SPBS 1.0.0 rc1 through rc4  (discontinued)

    If you're using Torque, no patches are required.  OpenPBS will need
    one patch to enable all the functionality.  PBSPro cannot be patched,
    but mpiexec includes some hacks to make it work properly.  Recent
    PBSPro (version 8?) does not work with mpiexec.  There is a
    temporary PBSPro-only patch from Matt Ford here:

	http://email.osc.edu/pipermail/mpiexec/2007/000842.html

1a. OpenPBS patch.

    This patch adds the functionality which allows the stdio streams from a
    parallel process to be sent directly to mpiexec.  It also provides the
    capability to send stdin to more than just process number zero, if you
    so choose.  It is not mandatory to apply this patch, in which case
    these stdio redirection features will not work, but the basic MPI
    spawning through the TM interface of PBS will still function just fine.

    See the Historical Notes below for information on use with PBSPro or
    older versions (< 2.3.12) of OpenPBS.

    Apply the patch doing something like this:

	cd /usr/local/src/pbs-2.3.12
    	patch -p1 -sNE < /home/pw/src/mpiexec/patch/pbs-2.3.12-mpiexec.diff

    Attempts have been made so that the behavior of PBS does not change
    unless explicitly instructed to do so by mpiexec.  You'll need to build
    and install PBS as usual, then restart all the MOMs on the compute
    nodes.

1b. (EXPERIMENTAL)  A second patch to PBS is necessary if you would like
    mpiexec jobs to survive across a restart of the pbs_mom using the "-p"
    flag to reattach existing jobs.  If you do not plan to kill and restart
    pbs_mom on a node while it has jobs running, do not bother with this
    patch, however it should do no harm.

    It does four things:

	- Fix coredump resulting from tm_spawn to restarted pbs_mom
	- Avoid race condition by which pbs_mom would sometimes kill itself
	  as tasks exit.
	- Make a restarted pbs_mom search for and report exiting tasks
	  from jobs which were started before the old mom was killed.
	- Change response of pbs_mom to various signals.  Now the default
	  is to leave all jobs running.  If you want to stop all jobs,
	  USR1 can be used to achieve the old behavior.

    Without this patch, mpiexec will exit with "tm: system error" when the
    new pbs_mom is started with the "-p" argument.

    If you want to experiment with this capability, apply the second
    patch similarly.  Be warned that this adds a function to the machine-
    specific code for linux, but no other architectures, thus this
    entire experiment requires linux:

    	patch -p1 -sNE < /home/pw/src/mpiexec/patch/pbs-2.3.12-mom-restart.diff

    Note that on my linux redhat 7.3 systems, PBS 2.3.12 will not actually
    compile out of the box without another patch unrelated to mpiexec.
    Grab and apply
	http://www.osc.edu/~pw/pbs/no-linux-headers.patch
    if this is the case for you, or read about all the patches we use at
	http://www.osc.edu/~pw/pbs/

1c. Old MPICH/P4 only.  If you are using an mpich older than 1.2.4, see the
    mpich section below for a necessary patch.

1d. Old MPICH/GM only.  WARNING!  If you are using an MPICH-GM distribution
    from Myricom that is older than 1.2.4..8, this version of mpiexec will
    not work.  Fall back to mpiexec-0.69 or upgrade your mpich-gm.

1e. Old Torque only.  You need the patch distributed here in patch/
    torque-1.1.0p0-mpiexec.diff.

2.  Run ./configure with the usual configure syntax.  There is one
    mandatory configure option, plus some other ones described below.


    Choosing a default communication device.

    You must choose a default communication device, that is, what variant
    of MPI library and network interfaces are used by your machines.  Try
    to pick the one that your users will use most often, e.g.:

	--with-default-comm=mpich-p4

    otherwise your users will always have to specify

	--comm mpich-p4

    at every invocation of mpiexec (or use an environment variable) to
    override your default.  The current list of devices is:

	--with-default-comm=(mpich-gm|mpich-mx|mpich-p4|mpich-ib|mpich-rai|
			     mpich2-pmi|lam|shmem|emp|portals|none)

    If the user does not use the "-comm" argument to mpiexec, and does not
    set the MPIEXEC_COMM environment variable, this named communication
    device will be used.


    PBS options.

    --with-pbs=PATH

	Specify the location of the PBS library.  Default is
	/usr/local/pbs, where the Makefile will expect to find
	files lib/libpbs.a and lib/liblog.a containing the TM interface
	functions, and header file include/tm.h.

    --enable-pbspro-helper

	Choose this option if you use PBSPro.  That batch system does
	not have the mpiexec patch, and unless you have the source and
	have patched it yourself, you will not get standard IO streams
	redirection.  This builds a separate executable that handles the
	redirection for the processes, then starts the parallel code.
	Do not use this option for OpenPBS or Torque.  See the man page
	for mpiexec-redir-helper for more information.


    Rarely used configure options.

    --with-mpicc=PATH
    --with-mpif77=PATH

	Name of mpicc code or script used to compile an mpi program.  This
	is only used for the test program and will not affect your mpiexec
	at all.  Default is "mpicc" which will look in your path for a
	suitable script.  Another possible choice would be, for example:
	"--with-mpicc=/home/frog/my-mpich/bin/mpicc".  Similar option for
	finding a fortran compiler, again completely optional.

    --with-sed=PATH

	Name of external program to use to implement --transform-hostname.
	This defaults to "sed" whose location is then looked up in the current
	path when configure is run.  The exact location must be available
	on the compute nodes when mpiexec runs.  You may supply a different
	path or program name here too.  If the argument is absolute, with a
	leading '/', it is accepted as given, otherwise it is searched for in
	the current path.

	Crazy example for perl devotees:  configure --with-sed=perl.
	Then at runtime one might do:
	mpiexec --transform-hostname='while (<>) { s/amd/mamd/; print }'.

    (EXPERIMENTAL)
    --with-fast-dist=PATH

	Normally mpiexec expects all the compute nodes to share a file
	system where the executable program lives, such as NFS from a
	single server.  If this is not the case, it is up to you to move
	the program out to the same location on all the nodes in advance.

	This option lets you use an external program to move the executable
	to the compute nodes with a fast, tree-based algorithm that
	operates natively on InfiniBand.  It is extremely quick compared to
	NFS.  To enable mpiexec to stage executables, install the code from
	http://www.osc.edu/~dennis/fastdist/ and compile mpiexec to tell it
	where to find the program "fast_dist".  If you do not give an
	absolute path for PATH, configure will search for it in your current
	PATH.


    Now for the individual communication libraries, and their options.  It
    is quite likely that you will not need to be concerned with any of this
    section.  Every version of MPI that mpiexec knows about will be
    supported in the resulting code, but you can choose to disable the ones
    you don't want.  Note that no MPI libraries are required, so there is
    no need to disable an option just because you don't currently use it on
    your system.  They're harmless.



    MPICH/GM and MPICH/MX

    --disable-mpich-gm

	Disable the use of Myrinet devices using MPICH over GM or MX.  Default
	is to support MPICH/GM and MPICH/MX.  Note that MX is the newer
	message passing interface from Myricom, but it is handled in
	mpiexec with the same code that does MPICH/GM.

    MPICH/p4

    --disable-mpich-p4

	Disable the use of sockets devices using MPICH with the p4 library.
	This is what people generally use with ethernet hardware.  Default
	is to include support for MPICH/p4.

    --disable-p4-shmem

	For SMP machines, specify that MPICH/P4 was compiled without shared
	memory support.  You must select whether you plan to use shared
	memory with MPICH/P4 when you compile the mpich library.  To use
	shared memory, add the configure option "--with-comm=shared" when
	you build mpich.  It is highly recommended that you enabled
	shared-memory communication in this way.

	Then when you configure mpiexec, if you have added that option to
	the mpich build, it is not necessary to do anything.  However, if
	you choose NOT to build mpich/p4 to use shared memory, you should add
	the flag "--disable-p4-shmem" here.  Note that you must make sure
	that mpich and mpiexec are compatible in this regard or applications
	will not start.

	The mpiexec command-line flags "-mpich-p4-no-shmem" and
	"-mpich-p4-shmem" can be used to specify MPICH/P4 configuration
	information explicitly at runtime, overriding this compile option.

	To summarize, configure lines should match as follows:
	    mpich/configure --with-device=ch_p4 --with-comm=shared ...
	    mpiexec/configure --with-default-comm=mpich-p4 ...
	Or
	    mpich/configure --with-device=ch_p4 ...
	    mpiexec/configure --with-default-comm=mpich-p4
	      --disable-p4-shmem ...

    MPICH/IB

    --disable-mpich-ib

	Disable the ability to start parallel processes compiled against
	an InfiniBand version of MPICH.  More information about this device
	can be found at http://nowlab.cis.ohio-state.edu/projects/mpi-iba/.

	This version of mpiexec supports OSU MVAPICH releases 0.9.2 and
	0.9.4 (and likely others) by autodetecting during process startup
	based on a version number in the protocol.

    MPICH/RAI

    --disable-mpich-rai

	Disable the code to start parallel processes compiled against the
	Rapid Array Interconnect version of MPICH used by Cray on their XD1
	machines.  These are Opteron clusters with custom message passing
	code on an Infiniband physical-layer transport.  The MPICH device
	comes from the MVIA heritage and thus looks a lot like the
	old-style MPICH/IB startup code.

    MPICH2/PMI

    Do not compile MPICH2 to use the SMPD process manager.  It appears
    to offer no advantages over the default MPD, and does not work with
    mpiexec.  That could be fixed, if there were a compelling reason to
    do so.  (The other offered process manager is gforker, which is also
    not very interesting, as it only works on a single node.)

    --disable-mpich2-pmi

	Disable the ability to start parallel processes compiled against
	the MPICH2 library PMI process management interface.  This
	mechanism is designed to support all underlying communication
	hardware supported by the new MPICH2 library.  More information
	is available at http://www-unix.mcs.anl.gov/mpi/mpich2/.

	This code is known to work with the ch3 device in MPICH2, but
	may work with other devices as they become available.  When compiling
	ch3, you have a choice of channels.  These are known to work
	as of mpich2-1.0.1 and mpiexec-0.78:

	    --with-device=ch3:sock
	    --with-device=ch3:shm
	    --with-device=ch3:ssm

	Unlike with MPICH1, it is not necessary to explain to mpiexec which
	variant you plan to use.

	Note that as of mpich2-1.0.3, MPI_Abort called in one task does not
	try to terminate the entire parallel process.  It would be nice if
	the aborting process told the process manager that an abort is in
	progress.  This does happen in mpich1/gm, mpich1/ib, and partially
	in mpich1/p4.  Instead, in mpich2, the processes not calling
	MPI_Abort will exit only if they happen to try to communicate with
	the aborting process.  Watch PMI_Abort() in
	mpich2/src/pmi/simple/simple_pmi.c to see if they ever add this
	functionality, at which time we can add support to mpiexec.

    LAM

    --disable-lam

	Disable the use of the LAM device.  There really isn't any code in
	here specific to LAM, as mpiexec is used only to startup the lamd
	on each node, and it spawns the actual user applications.  The LAM
	device acts exactly like the "none" device.  There are more notes
	on LAM at the bottom of this file, and in README.lam.

    SHMEM

    --disable-shmem

	Disable the use of the SHMEM device.  The SHMEM device is only used
	on single-node configurations, like for large SMPs.  There is no
	support for ethernet or any other out-of-box communication.  The
	options above about shmem under the P4 and GM sections are not
	related to this SHMEM device, but rather sub-drivers in the P4 and
	GM drivers, respectively.  If you have just one big Sun or HP SMP
	machine, for example, or some other single node multi-processor box
	you will want to use the SHMEM device.

    EMP

    --disable-emp

	Disable the use of the EMP device.  The procedure to startup an EMP
	job is much like that of GM, without the need for a globally
	readable configuration file.  More information about EMP is
	available at http://www.osc.edu/~pw/emp/.

    PORTALS

    --disable-portals

	A rather hacky, partial implementation of Portals support.  It
	assumes the use of the user-space TCP implementation of Portals,
	and that you will be using eth0 for communication.  It does set
	up the nidmap and pidmap environment variables, though, which is
	a pain to do by hand.  Big machines that use Portals have their
	own job launcher, called yod.

    NONE

    --disable-none

	This communication layer does not set anything in the environment,
	or build any configuration files.  Handy if you want to run
	something on each processor of your job allocation without wanting
	mpiexec to bother to build an environment for it.

3.  Build it:

	make

    Note that GNU make is required.  It may be called "gmake" on your
    system.

4.  Run the tests. (OPTIONAL)  You'll need a working MPICH of some flavor
    to build the hello test program.  The default compiler used for this
    task is "mpicc" unless you have configured with the "--with-mpicc"
    switch.

	make hello

    After compiling, be sure to take a look at the script "runtests.pl",
    especially the comments towards the top where there are some configurable
    items.  Then run it:

	./runtests.pl

    It invokes the batch systemm once for each of about 50 tests.  Each
    of these creates many little files:

	testqs.* - PBS job scripts submitted with qsub
	testqo.* - PBS joined stdout/stderr
	testho.* - mpiexec joined stdout/stderr
	testc.*  - config file passed to mpiexec with "-config" flag

    Successful tests will show the one-line qsub output and print dots
    until the test is complete.  Unsuccessful runs might say
    "Got 7 lines in ..., expected 8" or "Unexpected line: ...", in which
    case you may want to investigate the relevant output files.  Expect the
    "-segv" tests to generate some unexpected lines which vary depending
    on the communication library.  Also, the shell tests can cause some
    problems depending on what you have in /etc/shells and /etc/profile.d,
    etc.  When done,

    	rm test*

    to cleanup.  There's no need to look at the successful output files
    unless you're curious what happened.

5.  Install:

	make install

    This puts the executable in /usr/local/bin (or <prefix>/bin if you have
    told configure otherwise using, e.g. --prefix), and a man page in
    /usr/local/man/man1.  You may need to be root to do this.

6.  Cleanup:

	make clean

    Or "make distclean" if you want to zap the config.* output files too.


Concurrent tests
----------------
The concurrent mpiexec feature is described in the man page.  It allows
running multiple independent parallel programs in the same batch job.  Each
parallel program has its own invocation of mpiexec, with all the subsequent
ones relying on the first for communication with PBS (as required per
limitations in the TM library).

It creates a directory /tmp/mpiexec-sock with permissions 01777 and a
separate subdirectory in the format <username> with permisssions 00700
under that, one for each user.  Named pipes of the form <jobid>.<hostname>
are used for communication in the context of a single PBS job.  It tries to
clean up after itself but will handle gracefully the case where any of the
directories or files still exist.

You can test this concurrent code by starting an interactive job with a
bunch of processors, and inside the shell of the job, run "./contests.pl".
It needs the "hello" program to exist just like runtests.pl.  You'll see
a bunch of dots indicating each invocation successfully running, for some
large number of these in parallel.  If there is any output text, try to
figure out the problem and send mail to the list if there seems to be a
bug.


Problems?
---------
Here are some notes collected from solving various installation and
usage problems with mpiexec, organized into a FAQ format.

1.  Does mpiexec work with OpenPBS 2.4?

    There is no OpenPBS 2.4.  Veridian changed the code in 2.3.16
    so that it claims to be "OpenPBS_2.4".  Type "l s" at a qmgr
    prompt to see this.  The code is still 2.3.16 in spirit since
    it is hardly different from 2.3.15 or the last couple years
    of earlier versions for that matter.

2.  The configure script can't find my PBS library, but I gave it the
    correct path.

    You probably need to compile mpiexec using whatever compiler you used
    to build PBS, otherwise some symbols may not be defined.  This will
    show up as configure complaining "PBS library not found ...".  Check
    config.log to verify if it really was not found, or if you chose a
    different compiler.

    Override the compiler choice at configure time by setting the
    environment variables CC and CFLAGS.

    You can run "bash -x ./configure ..." to see everything it does
    to try to figure out what's wrong.

3.  Mpiexec exits immediately with the message "mpiexec: Error:
    get_hosts: tm_init: tm: system error".

    This is the very first line in the code where mpiexec attemps to
    talk to the local PBS mom.  Lots of things can go wrong so that
    PBS will not let that happen.  One problem could be that name
    resolution is not working correctly.  You need to have entries in
    /etc/hosts (or a working DNS resolver) for both localhost and for
    your PBS server, like this:

	127.0.0.1  localhost
	10.0.0.254 front-end fe  # pbs server

    Other variations might work too.  On the server, you probably
    need hosts entries for all the other nodes, too, but I suspect
    you'd notice something else broken before mpiexec.  Don't forget
    to restart pbs_mom or pbs_server as appropriate after changing
    a system configuration file like /etc/hosts.

4.  Are there any debugging tools to figure out why the entire mess
    does not work?  Especially this confusing "system error" message?

    There are lots of bits that must cooperate to run a parallel job:  PBS
    server, PBS mother superior, other PBS moms, mpiexec, mpich library,
    and your application code.  It's tough to figure out where the fault
    lies when something fails.

    PBS problems are frequently logged.  See on the mother superior node
    (the compute node which holds process #0 of your parallel job) the
    file
    	/var/spool/pbs/mom_logs/20021030
    or whatever the date is today.  On the PBS server machine, you'll
    find log messages in
    	/var/spool/pbs/server_logs/20021030
    If you install into a different location you'll have to change
    the path prefix, of course.

    The "big hammer" of debugging tools here is strace.  If mpiexec
    complains when talking to the PBS mom, grab the mpiexec with an strace
    and watch what it's doing right before it prints out the error
    message:
	strace -vfF -s 400 -o /tmp/strace.mpiexec.out mpiexec myjob
    Look through the output file for the error message, then back up
    a few lines and try to guess what went wrong.  If it looks harmless,
    maybe the PBS mom is causing the problem.  As root, find the pid
    of the pbs_mom on the node, then attach to it with strace in a
    different terminal session:
	strace -vfF -s 400 -o /tmp/strace.mom.out -p <pid>
    then start your job and watch what happens.

5.  When I do "mpiexec <script>", it doesn't work.

    Mpiexec is a parallel program spawner: it expects to be given an
    executable compiled with an MPI library.  Some MPI library versions
    initialize themselves using command-line arguments to the process.
    If you try to mpiexec a shell or perl script, for instance, these
    arguments are delivered to the shell, and it is your duty to pass
    them on to the actual MPI code when you invoke it.  Do something
    like the following if you must:

	#!/bin/bash
	echo hi from one of the parallel processes
	mpiexec a.out "$@"
	echo this one is all done

6.  My program sees extra weird command line arguments.

    In the MPICH/p4 library, the only way to start processes is to
    provide them with command-line arguments specifying information about
    their environment: hostname and port number of the "master", own node
    ID, total number of nodes, etc.  These appear in main() in the argv[]
    array and are passed into MPI_Init() which interprets them to construct
    the parallel environment.  It then removes from argv[] the arguments
    it understands and leaves the rest for the main program.

    If your code tries to parse the arguments in argv[] _before_ calling
    MPI_Init(&argc, &argv), you will unfortunately see, and not understand,
    these extra arguments.  The best solution is to put the call to
    MPI_Init before any argument processing.

7.  When my job is killed by PBS due to hitting a walltime (or any other)
    limit, the error output file has a strange line "mpiexec: warning:
    main: task x died with signal 15".

    This is proper behavior by mpiexec, and is one of the good features
    that makes it better than the rsh-based mpirun programs.

    Using mpirun, the PBS mom will kill all processes that it can find on
    the mother superior node (first node assigned to the job).  Eventually
    the MPI processes on other nodes will die off because they notice that
    one of their brethren has gone away when it is time to send it that
    deceased peer a message.  PBS does not know about these processes on
    other nodes since they were started via rsh, and can not know to kill
    them off.

    With mpiexec, PBS itself starts all the processes in the parallel job,
    thus when it notices that you have gone beyond your walltime, it can
    kill off each process individually, with no mess and no fuss.  This
    ensures that you don't get runaway processes due to code bugs, for one
    thing, and also accounts for CPU and other resources used by the entire
    job, not just process number zero.

8.  My code generates a long error message:

      process not in process table; my_unix_id = 29969 my_host=n124
      Probable cause:  local slave on uniprocessor without shared memory
      Probable fix:  ensure only one process on n124
      (on master process this means 'local 0' in the procgroup file)
      You can also remake p4 with SYSV_IPC set in the OPTIONS file
      Alternate cause:  Using localhost as a machine name in the progroup
      file.  The names used should match the external network names.

    Make sure you have configured and compiled mpich/p4 with
    "--comm=shared".  If you are sure you do _not_ want mpich to be able to
    do shared-memory communication within SMP nodes, then you must let
    mpiexec know about this.  The easiest way is to configure mpiexec with
    "--disable-p4-shmem" (described above) and recompile, or you can use
    the runtime flag "-mpich-p4-no-shmem" as a quick test to verify this is
    indeed the problem.  There is no way to auto-detect if mpich was
    configured with or without the shared option.

9.  The compute node processes do not start up properly, they say something
    like:

	[1] Error: Unable to connect to the master !

    This is an error message from MPICH-GM, and others may give a similar
    error when the compute processes are not able to contact back to the
    master.

    The hostnames of your compute nodes must be listed in /etc/hosts (or
    DNS if you have one) and assigned to the IP address of the machine as
    viewed by other nodes in the cluster.  A common mistake is to assign
    the hostname to the loopback address:

	127.0.0.1   node01 localhost
	192.168.0.1 node01

    Never do this.  A proper /etc/hosts file should look something like:

	127.0.0.1   localhost
	192.168.0.1 node01

    The problem happens when a compute process on node01 tries to resolve
    "node01" to figure out on what address to listen for incoming
    connections, and end up listening on the loopback where no external
    machine can connect.  Mpiexec has the same problem when it binds
    on a local port---if it ends up binding to 127.0.0.1 due to this
    /etc/hosts problem it will never receive connections from processes on
    different machines.

10. I get a bunch of messages "connect: Connection refused" and the code
    exits.

    If you're using the Mellanox Infiniband IBGD distribution, and you are
    using the mpich that they include, and you have OpenPBS or Torque on
    your machine, it won't work.  Mellanox included a patch to fix I/O
    redirection problems in PBSPro to satisfy one particular customer.
    That fix happens to break what would otherwise be working setups that
    use OpenPBS or Torque.

    As a quick hack, you can find the shared library libmpich.so, edit it
    and change the three strings that look like "MPIEXEC_STDOUT_PORT" (and
    STDERR and STDIN) and change them to, e.g. "ZPIEXEC_STDOUT_PORT" or
    anything else that is the same length and unlikely to be defined in
    your environment.

    Note, this is also a problem with OSU's mvapich releases 0.9.6 and
    0.9.7, as they included the bogus patch from Mellanox.  Starting
    with 0.9.8rc0, mvapich works again.

11. Jobs fail when approaching large processor counts (say 512).

    The error message might be "need XXX sockets, only YYY available" if
    detected early, or might appear later as "Too many open files".
    Common mpiexec usage requires two open sockets per task, or none
    for "-nostdout" usage.  The default open file limit is often low,
    around 1024.  In bash, "ulimit -n" will show the number of open files
    allowed in the session.  You can increase that on Red Hat-based systems
    by adding a line to /etc/security/limits.conf:
	* - nofile 65536

    Another way to increase the limits is to put a line in your
    /etc/init.d/pbs_mom (or equivalent) startup script that explicitly
    sets the limit for the mom and all its job descendents:
	ulimit -n 65536

12. Only one process is launched, and mpiexec says "task 0 exited before
    completing MPI startup".

    This happens when you are using MPICH2, but have told mpiexec that
    it should use the MPICH1/P4 communication method.  Try with
    "--comm=pmi", and if that works, rebuild mpiexec using
    "--with-default-comm=pmi" for convenience.

13. Mpiexec exits immediatly with the error "mpiexec: Error: get_hosts:
    pbs_connect: Unauthorized Request".

    You need to include the pbs_iff executable on your compute nodes,
    and it must be setuid root.  If you're using the Fedora Torque RPMs,
    this implies that you should install the torque-client RPM as well
    as libtorque and torque-mom.

    If the binary is present, check that the permissions are correct
    (srwxr-xr-x or similar), and that it is owned by root.  If the binary
    lives remotely on an NFS-mounted file system, be sure that you have
    not mounted with the "nosuid" option.


Historical notes
----------------
The items in this section were necessary for older versions of related
software.  For current releases, none of this is necessary.

1.  Upgrading from a previous version of mpiexec.

    Starting with mpiexec-0.64, a new patch, pbs-2.3.12-mpiexec.diff, is
    included which changes the operation of PBS _not_ to pass a job cookie
    at the start of every output stream.  This gets in the way of running
    multiple parallel jobs within a single batch job.  It seems PBS
    included the idea of a job cookie to add some security, but it is a
    rather weak form of security to protect a minor hole with little
    potential exploit gain (no root interactions).

    If you had an older version of mpiexec, which included its patch to
    PBS, you should reverse this one out first:

	cd /usr/local/src/pbs-2.3.12
    	patch -p1 -sRE < /home/pw/src/mpiexec/patch/pbs-2.3.11-mpiexec.diff

    Then continue as above to apply the _new_ PBS patch and recompile.  If
    you don't want to do this, it's not a big deal, but you will see a
    32-character string in your output before each node does its first
    communication to stdout/stderr.

2.  MPICH/P4 versions older than 1.2.4 only.

    Apply the patch to your mpich tree and rebuild it.
    Watch out!  It's important to look at the timestamp in
    $mpich/include/patchlevel.h to know which patch you need.  Depending
    on when you downloaded the tarball from Argonne, you'll have a possibly
    different version.  Here is the example for an MPICH distribution
    grabbed sometime between 19 Nov 01 and 18 Jan 02.  Note that the
    patch for the later MPICH distribution got quite smaller.  Thanks to
    the MPICH developers for being willing to integrate my fixes.

        cd /usr/local/src/mpich-1.2.3-alpha-011119
	patch -p1 -sNE < ~/src/mpiexec/patch/mpich-1.2.3-alpha-011119-mpiexec.diff
	make
	make install

    This is a necessary step to support older MPICH/P4 with mpiexec.  If
    you do not plan to use MPICH with ethernet cards for message passing,
    ignore this patch.

3.  OpenPBS versions older than 2.3.12.

    Older versions of PBS need more extensive bug fixes as well as the above
    functionality additions.  Mpiexec will not work at all without the
    patch.  A patch against pbs-2.2.11 is provided in this distribution for
    those that want to use mpiexec with older PBS.

    You will also have to configure mpiexec with another option to satisfy
    an older PBS distribution:

    --with-pbssrc=PATH

	This option is _only_ necessary with versions of pbs older
	than 2.3.5.  Before that patch, the necessary header file which
	exports the TM functions was not installed, hence we need to
	go wading through the source tree to find it ourselves.  You
	probably don't need this option, and if you do, you will be told
	noisily during configure.  Default is unset.


MPICH/p4 notes
--------------
There were quite a number of issues to work out in support MPICH/P4.

1.  --with-comm=shared

    I'll point out that you must compile mpich with "--with-comm=shared" to
    have SMP work, which is what most people with SMP machines want.  Maybe
    it is in the documentation, but I did not realize for some time.  There
    is a configure option to mpiexec if you really do not want to compile
    mpich with the shared option, see --disable-p4-shmem above.

2.  Spawn interface

    With MPICH/GM, you just start up the processes with certain environment
    variables on the machines you've been allocated, point them all to the
    same configuration file, and they find each other and run in parallel.

    With MPICH/P4, the code is designed to be started from a single machine,
    which then uses rsh to start the rest of the processes.  This won't work
    with mpiexec, and is a bad idea under PBS because you lose all the
    accounting information, one of the reasons mpiexec even exists.

    Thus, is there available in MPICH some interface which can defeat the
    self-spawning mechanism via rsh?

2a.  P4

    The P4 option "-p4norem" isn't acceptable because it would require
    mpiexec to read the stdout of process number zero to determine which
    jobs to start and in what order.  Definitely non-scalable, and the
    requirement that the user application not call printf() before
    MPI_Init() seemed too onerous.  True, I could have disabled the
    printf() and gone grepping through /proc for the socket, but it still
    requires one-at-a-time startup.

2b.  Execer

    The execer interface was what I ended up using, but it required large
    modifications.  The examples that come with mpich which use the execer
    interface were not working due to argument mismatch, so I felt it okay
    to revamp the way the library interpreted the arguments.  The idea is
    that you start the "big master" process with a list of all the other
    nodes in the job on the command line.  Each of the "remote master"
    processes gets a small amount of information on the command line, and a
    pointer to the host/port where the big master is listening.  Thus we
    must start process number zero, wait for him to initialize, then start
    the rest.

    Bits I removed from the execer interface:  originally all the remote
    masters would do a loop of "rsh <bigmaster> cat /tmp/p4_5656" to get
    the port on which the big master was listening.  Blecch.  Instead I
    introduced that ordering constraint mentioned above.  Rather than
    having the big master write to a file in /tmp, I tell it to connect to
    mpiexec to give it the port number, which is then passed on the command
    line to all the other processes.

    Note that MPICH/P4 has hardcoded into it that the "big master" connects
    to an execer on localhost to write this port number.  The implication
    is that mpiexec must run on the same machine as process #0, thus the
    -nolocal option will not work.

2c.  MPD

    None of this relates with MPD at all.  MPD is another daemon which runs
    on the compute nodes and can spawn MPICH processes.  It does not
    interact with PBS with regards to accounting, allocations, etc.

2d.  Thread (vs process) listener

    Do not define THREAD_LISTENER, only the process listener code has been
    modified to work properly with execer_starting_remotes.


MPICH/gm notes
--------------
In the old days, before mpich-1.2.4..8, starting up MPICH-GM parallel
processes was relatively simple.  Provide a global file on a shared file
system which explained the configuration of the machine, and all processes
would read through that.  That worked well assuming you had such a file
system.

Starting with mpich-1.2.4..8, Myricom chose to get rid of that global
file, and instead requires multiple communication from the startup process,
such as mpiexec or mpirun.ch_gm, to each of the slave processes.  Mpiexec
opens two listening sockets and starts each process independently.  They
each connect back to mpiexec and provide information including the Myrinet
board and port number to be used by that process.  That first listening
socket is closed.

Then we repeat: each process again calls back to the second listening
socket opened by mpiexec, sends a magic number, then expects mpiexec to
send out the global and local mapping information.  This second socket is
then closed, but mpiexec continues to listen as a new connection might be
initiated by a process to request an MPI_Aborts and teardown of the entire
job.  Why two sockets?  Dunno, ask Myricom.  Inside mpiexec we use a second
process to handle the stdio because the initial process is trapped blocking
in a TM call, so this second listening socket must be passed between the
processes since to close it and reopen would risk losing an abort message.
What a pain.  Count the packet round trips, multiply number of nodes and
figure out how well this all will scale.


LAM notes
---------
Please see README.lam.  It will point out that your patched LAM will use
mpiexec to talk to PBS, but that you still use lamboot and lamrun in your
batch script as usual.


TODO
----
Allocate a TTY so that readline works for stdin operations, if
requested.

Consider integrating some debugging facility, like gdb with multiple
threads.  Need the TTY allocation probably for this to work.


# vim: set tw=75 :
