HiveBrain v1.2.0
Get Started
← Back to all entries
patterncMinor

Portability and "dark corner" gotchas in this "watchdog" program

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
thiscornergotchasprogramdarkandportabilitywatchdog

Problem

The following program is intended to "watchdog" a child process, and all subprocesses the child spawns.

The intended behavior is:

  • A child process (as specified on the command line) is run in a dedicated process group.



  • If that process exits, or if it stops producing heartbeat output, the entire process group is killed off and restarted.



  • But if the child exits successfully and has just printed a particular sentinel string, the job is done and the watchdog should just exit.



  • If the watchdog process receives certain fatal signals (TERM, INT, QUIT) it is to pass them along to the monitored process group and then exit itself.



I have endeavored to code this as defensively as possible, but I could use a second opinion particularly on the following aspects:

-
Is this use of process groups safe and reliable? (This program will not be run as a background job in an interactive shell; however, it may be run from a noninteractive shell script with no controlling terminal.)

-
Note the comment immediately above the pair of calls to pselect. Is my workaround actually reliable on all systems?

-
Is the signal handling sufficiently defensive? How about the process creation code?

-
Have I neglected any portability issues? (Assume POSIX.1-2001 + strsignal.)

```
/*
* Usage: watchdog worker args ...
* Runs 'worker' and all its subprocesses in a dedicated process group.
* If 'worker' exits, or if it produces no output (on either stdout or
* stderr) for five minutes, the entire process group is killed off and
* restarted.
*
* We consider 'worker' to be done with its work, and do not restart
it, if it exits successfully and* the last six characters of its
* output are exactly "\nDONE\n".
*/

/* The most exotic features this program requires are pselect(), which
is in POSIX.1-2001, and strsignal(), which is only in POSIX.1-2008 (!) */
#define _POSIX_C_SOURCE 200809L

#include
#include
#include
#include
#include
#include
#include
#include
#i

Solution

I am not familiar enough with process control to find any of the problems you seek, so treat this as a general review, for what it is worth, starting from the top:

I don't know what advantage your IGNORE_ERRORS macro has over a simple
(void) cast for an ignored return value.

I imagine you have a good reason for using _exit in child_startup_err
rather than the normal exit, but it escapes me at the moment.

Your child_startup errors are logged to output_fd rather than to stderr.
I'm not sure if this is an issue but maybe I'll come back to it...

Why do you set opipe[1] and devnull_fd to close-on-exec. These are
created and duped for the child stdio descriptors, so why would you want them
closed on exec of the child process? And is there an advantage in assigning /dev/null to stdin instead of just closing it?

child_spawn could benefit from a goto for failure handling. Also I don't
see the point of the synchronization pipe between parent and child. It is
clever, but what does it achieve? Your parent times-out if it fails to get a
heartbeat every 5 minutes but there's little risk that the child will not have
started up in that time.

In exit_on_signal, which re-raise the signal instead of just exiting?

The endless loop in process_output:

for (;;) {
    count = read(fd, block, sizeof block);
    if (count == 0)
        break; /* EOF */


might be better expressed as:

while ((count = read(fd, block, sizeof block)) != 0) {


The subordinate loop:

for (p = block; p < block + count; ) {
        n = fwrite(p, 1, count - (p - block), stdout);
        if (n == 0) {
            perror("fwrite");
            break;
        }
        p += n;
    }


looks unnecessary. You are using buffered stdio so a single fwrite should
suffice. And if the write fails to write all bytes requested there was an
error (not just if it returns 0).

The embedded constant 6 in the function might be better replaced by a #define

In main:

It is rather big. It seems like some functionality could be extracted to
functions. You have loops nested 3 deep...

Many variables could be defined at or nearer to their place of first use.

I'd rather see all memcpys use the size of the target buffer
rather than the size of the type.

Could the condition you are guarding against with the two pselect calls
occur between the calls?

When a SIGCHILD signal has occurred, why loop calling waitpid with WNOHANG
instead of just waiting in waitpid for the termination to conclude? Does
this really lessen the chance of missing an exit - does that possibility
really exist?

Also I don't see the advantage in the 'sentinel' sent by the child as opposed
to it just exiting cleanly (EXIT_SUCCESS). The sentinel just complicates the
code. And why use 127 for failure and not EXIT_FAILURE?

You are checking the child pid against the expected pid, but you only have one
child, so how would you get the wrong pid? Is there a way the pid can be
wrong (do children of the child get propagated?)?

Code Snippets

for (;;) {
    count = read(fd, block, sizeof block);
    if (count == 0)
        break; /* EOF */
while ((count = read(fd, block, sizeof block)) != 0) {
for (p = block; p < block + count; ) {
        n = fwrite(p, 1, count - (p - block), stdout);
        if (n == 0) {
            perror("fwrite");
            break;
        }
        p += n;
    }

Context

StackExchange Code Review Q#31252, answer score: 5

Revisions (0)

No revisions yet.