patterncMinor
Portability and "dark corner" gotchas in this "watchdog" program
Viewed 0 times
thiscornergotchasprogramdarkandportabilitywatchdog
Problem
The following program is intended to "watchdog" a child process, and all subprocesses the child spawns.
The intended behavior is:
I have endeavored to code this as defensively as possible, but I could use a second opinion particularly on the following aspects:
-
Is this use of process groups safe and reliable? (This program will not be run as a background job in an interactive shell; however, it may be run from a noninteractive shell script with no controlling terminal.)
-
Note the comment immediately above the pair of calls to
-
Is the signal handling sufficiently defensive? How about the process creation code?
-
Have I neglected any portability issues? (Assume POSIX.1-2001 +
```
/*
* Usage: watchdog worker args ...
* Runs 'worker' and all its subprocesses in a dedicated process group.
* If 'worker' exits, or if it produces no output (on either stdout or
* stderr) for five minutes, the entire process group is killed off and
* restarted.
*
* We consider 'worker' to be done with its work, and do not restart
it, if it exits successfully and* the last six characters of its
* output are exactly "\nDONE\n".
*/
/* The most exotic features this program requires are pselect(), which
is in POSIX.1-2001, and strsignal(), which is only in POSIX.1-2008 (!) */
#define _POSIX_C_SOURCE 200809L
#include
#include
#include
#include
#include
#include
#include
#include
#i
The intended behavior is:
- A child process (as specified on the command line) is run in a dedicated process group.
- If that process exits, or if it stops producing heartbeat output, the entire process group is killed off and restarted.
- But if the child exits successfully and has just printed a particular sentinel string, the job is done and the watchdog should just exit.
- If the watchdog process receives certain fatal signals (TERM, INT, QUIT) it is to pass them along to the monitored process group and then exit itself.
I have endeavored to code this as defensively as possible, but I could use a second opinion particularly on the following aspects:
-
Is this use of process groups safe and reliable? (This program will not be run as a background job in an interactive shell; however, it may be run from a noninteractive shell script with no controlling terminal.)
-
Note the comment immediately above the pair of calls to
pselect. Is my workaround actually reliable on all systems?-
Is the signal handling sufficiently defensive? How about the process creation code?
-
Have I neglected any portability issues? (Assume POSIX.1-2001 +
strsignal.)```
/*
* Usage: watchdog worker args ...
* Runs 'worker' and all its subprocesses in a dedicated process group.
* If 'worker' exits, or if it produces no output (on either stdout or
* stderr) for five minutes, the entire process group is killed off and
* restarted.
*
* We consider 'worker' to be done with its work, and do not restart
it, if it exits successfully and* the last six characters of its
* output are exactly "\nDONE\n".
*/
/* The most exotic features this program requires are pselect(), which
is in POSIX.1-2001, and strsignal(), which is only in POSIX.1-2008 (!) */
#define _POSIX_C_SOURCE 200809L
#include
#include
#include
#include
#include
#include
#include
#include
#i
Solution
I am not familiar enough with process control to find any of the problems you seek, so treat this as a general review, for what it is worth, starting from the top:
I don't know what advantage your
I imagine you have a good reason for using
rather than the normal
Your
I'm not sure if this is an issue but maybe I'll come back to it...
Why do you set
created and duped for the child stdio descriptors, so why would you want them
closed on exec of the child process? And is there an advantage in assigning
see the point of the synchronization pipe between parent and child. It is
clever, but what does it achieve? Your parent times-out if it fails to get a
heartbeat every 5 minutes but there's little risk that the child will not have
started up in that time.
In
The endless loop in
might be better expressed as:
The subordinate loop:
looks unnecessary. You are using buffered stdio so a single
suffice. And if the write fails to write all bytes requested there was an
error (not just if it returns 0).
The embedded constant 6 in the function might be better replaced by a #define
In
It is rather big. It seems like some functionality could be extracted to
functions. You have loops nested 3 deep...
Many variables could be defined at or nearer to their place of first use.
I'd rather see all
rather than the size of the type.
Could the condition you are guarding against with the two
occur between the calls?
When a
instead of just waiting in
this really lessen the chance of missing an exit - does that possibility
really exist?
Also I don't see the advantage in the 'sentinel' sent by the child as opposed
to it just exiting cleanly (
code. And why use 127 for failure and not
You are checking the child pid against the expected pid, but you only have one
child, so how would you get the wrong pid? Is there a way the pid can be
wrong (do children of the child get propagated?)?
I don't know what advantage your
IGNORE_ERRORS macro has over a simple(void) cast for an ignored return value.I imagine you have a good reason for using
_exit in child_startup_errrather than the normal
exit, but it escapes me at the moment.Your
child_startup errors are logged to output_fd rather than to stderr.I'm not sure if this is an issue but maybe I'll come back to it...
Why do you set
opipe[1] and devnull_fd to close-on-exec. These arecreated and duped for the child stdio descriptors, so why would you want them
closed on exec of the child process? And is there an advantage in assigning
/dev/null to stdin instead of just closing it?child_spawn could benefit from a goto for failure handling. Also I don'tsee the point of the synchronization pipe between parent and child. It is
clever, but what does it achieve? Your parent times-out if it fails to get a
heartbeat every 5 minutes but there's little risk that the child will not have
started up in that time.
In
exit_on_signal, which re-raise the signal instead of just exiting?The endless loop in
process_output:for (;;) {
count = read(fd, block, sizeof block);
if (count == 0)
break; /* EOF */might be better expressed as:
while ((count = read(fd, block, sizeof block)) != 0) {The subordinate loop:
for (p = block; p < block + count; ) {
n = fwrite(p, 1, count - (p - block), stdout);
if (n == 0) {
perror("fwrite");
break;
}
p += n;
}looks unnecessary. You are using buffered stdio so a single
fwrite shouldsuffice. And if the write fails to write all bytes requested there was an
error (not just if it returns 0).
The embedded constant 6 in the function might be better replaced by a #define
In
main:It is rather big. It seems like some functionality could be extracted to
functions. You have loops nested 3 deep...
Many variables could be defined at or nearer to their place of first use.
I'd rather see all
memcpys use the size of the target bufferrather than the size of the type.
Could the condition you are guarding against with the two
pselect callsoccur between the calls?
When a
SIGCHILD signal has occurred, why loop calling waitpid with WNOHANGinstead of just waiting in
waitpid for the termination to conclude? Doesthis really lessen the chance of missing an exit - does that possibility
really exist?
Also I don't see the advantage in the 'sentinel' sent by the child as opposed
to it just exiting cleanly (
EXIT_SUCCESS). The sentinel just complicates thecode. And why use 127 for failure and not
EXIT_FAILURE?You are checking the child pid against the expected pid, but you only have one
child, so how would you get the wrong pid? Is there a way the pid can be
wrong (do children of the child get propagated?)?
Code Snippets
for (;;) {
count = read(fd, block, sizeof block);
if (count == 0)
break; /* EOF */while ((count = read(fd, block, sizeof block)) != 0) {for (p = block; p < block + count; ) {
n = fwrite(p, 1, count - (p - block), stdout);
if (n == 0) {
perror("fwrite");
break;
}
p += n;
}Context
StackExchange Code Review Q#31252, answer score: 5
Revisions (0)
No revisions yet.