Debugging STDIN issues with MPI_Comm_Spawn on MPICH

At work we encountered a strange issue where reading from STDIN on a process that calls MPI_Comm_Spawn causes the program to crash. The solution was non-obvious, so I thought I’d share my fix here for other MPICH users.


Recently, at my day job I’ve been onboarding onto a project to introduce a new mode of execution that uses MPI_Comm_Spawn to run a program in parallel. MPI_Comm_Spawn spawns a MPI process from an MPI process - creating two MPI process groups - and then creates a communicator (called an intercommunicator) which allows you to send messages between two groups. For our use case, we wanted users to be able to write/run a script without using mpirun/mpiexec and then use MPI_Comm_Spawn to create a group of child processes (that could be distributed) to do the heavy lifting. We use MPICH as our MPI implementation on MacOS.

We noticed that when running our code in an interactive shell (e.g. ipython), we would see errors like the following and then the shell would crash:

[proxy:0:0HOSTNAME] HYDU_sock_write (lib/utils/sock.c:250): write error (Bad file descriptor)
[proxy:0:0@0HOSTNAME] HYD_pmcd_pmip_control_cmd_cb (proxy/pmip_cb.c:521): unable to write to downstream stdin
[proxy:0:0@0HOSTNAME] HYDT_dmxu_poll_wait_for_event (lib/tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:0@0HOSTNAME] main (proxy/pmip.c:127): demux engine error waiting for event fish: Job 1, 'ipython' terminated by signal SIGKILL (Forced quit)

One workaround was to do mpiexec -n 1 ipython, but this wasn’t acceptable, as it made interactivity poor (MPI doesn’t handle interactive ioctls well). In order to solve this issue, we had to close STDIN right before spawning the child communicator. See the end of this post for a code example.

I wanted to make this blog post since googling around didn’t help, and I ended up having to read the MPICH source code to understand how Hydra (the launcher used by MPICH) handles IO events.

Debugging

At first, we didn’t understand why it was crashing. I tried a bunch of things to try to get for information. I added sleeps in the parent, sleeps in the child, logging everywhere. We didn’t know if it was an issue with our code in the spawn process being incorrect, or maybe the parent process having some bug. Mysteriously, the crash would go away if we ran the parent process with mpiexec -n 1. Eventually, I noticed that after the spawn operation, if I spammed the shell with the enter key, it would crash before doing any other work. This led me to believe that the issue was reading from STDIN. From there, I looked at the MPICH source code, where I found the following:

HYD_status HYD_pmcd_pmip_control_cmd_cb(int fd, HYD_event_t events, void *userp)
{
    ...
    } else if (hdr.cmd == CMD_STDIN) {
            ...
            status = HYDU_sock_write(p->in, buf, hdr.buflen, &count, &closed, HYDU_SOCK_COMM_NONE);
            HYDU_ERR_POP(status, "unable to write to downstream stdin\n");
            ...

This seemed to match the error message above (though the line numbers were different). It was coming from an STDIN event! This aligned well with the hypothesis that the crash was happening when STDIN was read, but only when the parent process was run without mpiexec. So how can we prevent STDIN from being read. As far as I can tell, Hydra does not provide a mechanism to disable reading STDIN. We looked through the code for a while, and eventually found HYDT_dmxu_poll_stdin_valid which checks if the stdin file descriptor supports poll (we’re using the default poll as the demux implementation). If this function marks STDIN as invalid, then Hydra will not listen for events from STDIN.

Reproducer/Fix

The following script (when saved as repro.py) will reproduce the issue. Note that you will need to use this version of mpi4py.

import os
import sys

from mpi4py import MPI


def parent():
    num_procs = 1

    errcodes = [0] * num_procs
    comm_world = MPI.COMM_WORLD
    intercomm = comm_world.Spawn(
        "python",
        ["repro.py", "child"],
        num_procs,
        MPI.INFO_NULL,
        0,
        errcodes
    )
    intercomm.barrier()


def child():
    print("Hello from child")
    comm_world = MPI.COMM_WORLD
    comm_world.barrier()

    intercomm = comm_world.Get_parent()
    intercomm.barrier()

if __name__ == "__main__":
    if len(sys.argv) >= 2 and sys.argv[1] == "child":
        child()
    else:
        parent() 

The fix was to do the following:

...
import contextlib

@contextlib.contextmanager
def no_stdin():
    """Temporarily close stdin and execute a block of code"""
    # Save a refence to the original stdin
    stdin_dup = os.dup(0)
    # Close stdin
    os.close(0)
    try:
        yield 1
    finally:
        # Restore the saved fd
        os.dup2(stdin_dup, 0)



def parent():
    ...
    with no_stdin():
        intercomm = comm_world.Spawn(
            "python",
            ["repro.py", "child"],
            num_procs,
            MPI.INFO_NULL,
            0,
            errcodes
        )
    ...
...

Is this the right fix?

I wasn’t satisfied with this. It felt rather hacky and I felt like the appropriate fix would be a level deeper, as an option one could pass to MPICH. So, I forked the repo and started adding a new flag to close STDIN in the child processes. Along the way, I’ve learned the following:

  • the hydra_pmi_proxy process needs a valid STDIN. I don’t know exactly why, but it seems to use it to send control messages to the spawning MPI process.
  • When MPI_Comm_Spawn is called from a non-MPI process, it first creates an MPI child process, which then spawn a new group. If the caller is an MPI process, this step can be skipped.

Putting this together - we actually do want the child processes to have valid STDIN’s but this STDIN is not connected to the STDIN of the parent process at all - it’s always a pipe. Instead, we need to ensure that if an intermediate mpiexec process is spawned, it should have an invalid STDIN. With that knowledge, I feel more confident about the fix above. I think it is the right fix, and it is in the right place. Maybe it could be nice to have an option for MPI_Comm_Spawn to close STDIN when it creates a mpiexec process, but I think such an option would likely be more confusing and difficult to use correctly in the common case. Anyway, this was a fun adventure and I definitely gained a deeper understanding of MPICH’s internals! Hopefully I’ll have more opportunities to play around with this code in the future.

Written on September 27, 2024