HiveBrain v1.2.0
Get Started
← Back to all entries
patternbashMinor

BASH script to monitor subprocess and throttle it for CPU temperature control

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
scripttemperaturecontrolsubprocessforthrottlemonitorandcpubash

Problem

I need to run CPU-intensive tasks on a very old machine with overheating issues. The script below will monitor temperature and pause the jobs when it gets too high, continuing them when it's back to normal.

The actual commands run are, of course, not included, since they are irrelevant to the question.

I am looking for hidden traps I may have set in my code (listed at the bottom), and for other things I have done incorrectly. Aside from special characters in the commands and arguments that are run, which are hand created so I can control that risk, what traps or "got-ya's" have I unknowingly set into the code? What ways are there for making this more error-proof, or better in other ways?

For the timing function I know I could have used the

time { command ...; command ...; }


construct, but I was more interested in the time spent by the machine (and previously, by me in the chair) than in the CPU time involved.

The Script:

The code comments should explain what it does, as well as why I did some of it the way I did.

```
#!/bin/bash

# Build my time reporting function
function report {
# Get the current time, do the math, report the results.
end_time=$(date +%s);

# The time used for the last run process
proc_time=$(echo "$end_time"-"$start_time" | bc);
echo " *** Processing time: $(date -u -d @${proc_time} +%T)";

# The cummulative time for all processes so far
run_time=$(echo "$end_time"-"$launch_time" | bc);
echo " *** Running time: $(date -u -d @${run_time} +%T)";
}

# The high and low temperatures to monitor for. Processing is paused
# once the high temp is reached, and will not resume again until the
# low temp is reached.

# My system recovers to 60°C reasonably quick (idle is around 45°C)
temp_lo=60;

# My system dies at about 115°C - since 100°C is normal, suggests my
# sensors are not accurate, but I work with what I have.
# 20°C margin allows for delay in the detection of the high temp, and
# delay in

Solution

Although unrelated to the code, I'll mention that for a CPU to overheat, especially a dual-core CPU, is not usual except with very high ambient temps. I'd suggest removing the heat sink and re-applying thermal paste. Any number of youtube videos can provide step-by-step instructions.

Moving on to the code:

  • terminal semicolons aren't needed



  • configuration should go at the top



  • kill -0 PID is a portable alternative to -e /proc/$pid



  • bash builtins let and [[ x -gt y ]] can replace bc for these purposes



  • [[ .. ]] is a builtin alternative to [ .. ]



  • date +%s can be replaced by builtin printf



  • gawk can extract the temperature more flexibly than grep+sed



  • your time/run/report pattern can be factored into a function



  • the monitoring loop can be simplified by moving sleep to the end



  • no real harm in monitoring more aggressively, since the loop is not going to use a lot of CPU



  • can save a couple of forks by reading temp directly from /sys, at the cost of CPU-vendor specifity



Putting it all together:

#!/bin/bash
temp_lo=60
temp_hi=95

# reading temps from /sys is CPU-vendor-specific, eliminates need for external sensors program
# temp_label=$( grep -l ^Core /sys/bus/platform/devices/coretemp.*/hwmon/hwmon*/temp*_label  |head -1 )
# temp_source=${temp_label%_label}_input

alias now="printf '%(%s)T\n' -1"

function watch_child {
    childd=$1
    while kill -0 $childd >& /dev/null; do
        temp=$( sensors | gawk -F'[: +°.]+' '/^Core.?1/ {print $3;exit}' )
        # temp=$(( $(<$temp_source) / 1000 ))
        [[ $temp -ge $temp_hi ]] && kill -SIGSTOP $childd
        [[ $temp -le $temp_lo ]] && kill -SIGCONT $childd
        sleep 1
    done
}

function elapsed {
    echo " ******* $1 time: $(date -u -d @$(( ${3:-$(now)}-$2 )) +%T)"
}

function monitor {
    launch_time=${launch_time:-$(now)}
    start_time=$(now)
    echo "********* $1"
    shift
    "$@" &
    watch_child $!
    elapsed Processing $start_time
    elapsed Running $launch_time
}

monitor "The step to perform." my_long_running_command arg1 arg2 
monitor "The next step to perform." another_long_running_command arg1 arg2

Code Snippets

#!/bin/bash
temp_lo=60
temp_hi=95

# reading temps from /sys is CPU-vendor-specific, eliminates need for external sensors program
# temp_label=$( grep -l ^Core /sys/bus/platform/devices/coretemp.*/hwmon/hwmon*/temp*_label  |head -1 )
# temp_source=${temp_label%_label}_input

alias now="printf '%(%s)T\n' -1"

function watch_child {
    childd=$1
    while kill -0 $childd >& /dev/null; do
        temp=$( sensors | gawk -F'[: +°.]+' '/^Core.?1/ {print $3;exit}' )
        # temp=$(( $(<$temp_source) / 1000 ))
        [[ $temp -ge $temp_hi ]] && kill -SIGSTOP $childd
        [[ $temp -le $temp_lo ]] && kill -SIGCONT $childd
        sleep 1
    done
}

function elapsed {
    echo " ******* $1 time: $(date -u -d @$(( ${3:-$(now)}-$2 )) +%T)"
}

function monitor {
    launch_time=${launch_time:-$(now)}
    start_time=$(now)
    echo "********* $1"
    shift
    "$@" &
    watch_child $!
    elapsed Processing $start_time
    elapsed Running $launch_time
}

monitor "The step to perform." my_long_running_command arg1 arg2 
monitor "The next step to perform." another_long_running_command arg1 arg2

Context

StackExchange Code Review Q#152320, answer score: 4

Revisions (0)

No revisions yet.