patternbashMinor
BASH script to monitor subprocess and throttle it for CPU temperature control
Viewed 0 times
scripttemperaturecontrolsubprocessforthrottlemonitorandcpubash
Problem
I need to run CPU-intensive tasks on a very old machine with overheating issues. The script below will monitor temperature and pause the jobs when it gets too high, continuing them when it's back to normal.
The actual commands run are, of course, not included, since they are irrelevant to the question.
I am looking for hidden traps I may have set in my code (listed at the bottom), and for other things I have done incorrectly. Aside from special characters in the commands and arguments that are run, which are hand created so I can control that risk, what traps or "got-ya's" have I unknowingly set into the code? What ways are there for making this more error-proof, or better in other ways?
For the timing function I know I could have used the
construct, but I was more interested in the time spent by the machine (and previously, by me in the chair) than in the CPU time involved.
The Script:
The code comments should explain what it does, as well as why I did some of it the way I did.
```
#!/bin/bash
# Build my time reporting function
function report {
# Get the current time, do the math, report the results.
end_time=$(date +%s);
# The time used for the last run process
proc_time=$(echo "$end_time"-"$start_time" | bc);
echo " *** Processing time: $(date -u -d @${proc_time} +%T)";
# The cummulative time for all processes so far
run_time=$(echo "$end_time"-"$launch_time" | bc);
echo " *** Running time: $(date -u -d @${run_time} +%T)";
}
# The high and low temperatures to monitor for. Processing is paused
# once the high temp is reached, and will not resume again until the
# low temp is reached.
# My system recovers to 60°C reasonably quick (idle is around 45°C)
temp_lo=60;
# My system dies at about 115°C - since 100°C is normal, suggests my
# sensors are not accurate, but I work with what I have.
# 20°C margin allows for delay in the detection of the high temp, and
# delay in
The actual commands run are, of course, not included, since they are irrelevant to the question.
I am looking for hidden traps I may have set in my code (listed at the bottom), and for other things I have done incorrectly. Aside from special characters in the commands and arguments that are run, which are hand created so I can control that risk, what traps or "got-ya's" have I unknowingly set into the code? What ways are there for making this more error-proof, or better in other ways?
For the timing function I know I could have used the
time { command ...; command ...; }construct, but I was more interested in the time spent by the machine (and previously, by me in the chair) than in the CPU time involved.
The Script:
The code comments should explain what it does, as well as why I did some of it the way I did.
```
#!/bin/bash
# Build my time reporting function
function report {
# Get the current time, do the math, report the results.
end_time=$(date +%s);
# The time used for the last run process
proc_time=$(echo "$end_time"-"$start_time" | bc);
echo " *** Processing time: $(date -u -d @${proc_time} +%T)";
# The cummulative time for all processes so far
run_time=$(echo "$end_time"-"$launch_time" | bc);
echo " *** Running time: $(date -u -d @${run_time} +%T)";
}
# The high and low temperatures to monitor for. Processing is paused
# once the high temp is reached, and will not resume again until the
# low temp is reached.
# My system recovers to 60°C reasonably quick (idle is around 45°C)
temp_lo=60;
# My system dies at about 115°C - since 100°C is normal, suggests my
# sensors are not accurate, but I work with what I have.
# 20°C margin allows for delay in the detection of the high temp, and
# delay in
Solution
Although unrelated to the code, I'll mention that for a CPU to overheat, especially a dual-core CPU, is not usual except with very high ambient temps. I'd suggest removing the heat sink and re-applying thermal paste. Any number of youtube videos can provide step-by-step instructions.
Moving on to the code:
Putting it all together:
Moving on to the code:
- terminal semicolons aren't needed
- configuration should go at the top
kill -0 PIDis a portable alternative to-e /proc/$pid
- bash builtins
letand[[ x -gt y ]]can replacebcfor these purposes
[[ .. ]]is a builtin alternative to[ .. ]
date +%scan be replaced by builtinprintf
gawkcan extract the temperature more flexibly thangrep+sed
- your time/run/report pattern can be factored into a function
- the monitoring loop can be simplified by moving
sleepto the end
- no real harm in monitoring more aggressively, since the loop is not going to use a lot of CPU
- can save a couple of forks by reading temp directly from /sys, at the cost of CPU-vendor specifity
Putting it all together:
#!/bin/bash
temp_lo=60
temp_hi=95
# reading temps from /sys is CPU-vendor-specific, eliminates need for external sensors program
# temp_label=$( grep -l ^Core /sys/bus/platform/devices/coretemp.*/hwmon/hwmon*/temp*_label |head -1 )
# temp_source=${temp_label%_label}_input
alias now="printf '%(%s)T\n' -1"
function watch_child {
childd=$1
while kill -0 $childd >& /dev/null; do
temp=$( sensors | gawk -F'[: +°.]+' '/^Core.?1/ {print $3;exit}' )
# temp=$(( $(<$temp_source) / 1000 ))
[[ $temp -ge $temp_hi ]] && kill -SIGSTOP $childd
[[ $temp -le $temp_lo ]] && kill -SIGCONT $childd
sleep 1
done
}
function elapsed {
echo " ******* $1 time: $(date -u -d @$(( ${3:-$(now)}-$2 )) +%T)"
}
function monitor {
launch_time=${launch_time:-$(now)}
start_time=$(now)
echo "********* $1"
shift
"$@" &
watch_child $!
elapsed Processing $start_time
elapsed Running $launch_time
}
monitor "The step to perform." my_long_running_command arg1 arg2
monitor "The next step to perform." another_long_running_command arg1 arg2Code Snippets
#!/bin/bash
temp_lo=60
temp_hi=95
# reading temps from /sys is CPU-vendor-specific, eliminates need for external sensors program
# temp_label=$( grep -l ^Core /sys/bus/platform/devices/coretemp.*/hwmon/hwmon*/temp*_label |head -1 )
# temp_source=${temp_label%_label}_input
alias now="printf '%(%s)T\n' -1"
function watch_child {
childd=$1
while kill -0 $childd >& /dev/null; do
temp=$( sensors | gawk -F'[: +°.]+' '/^Core.?1/ {print $3;exit}' )
# temp=$(( $(<$temp_source) / 1000 ))
[[ $temp -ge $temp_hi ]] && kill -SIGSTOP $childd
[[ $temp -le $temp_lo ]] && kill -SIGCONT $childd
sleep 1
done
}
function elapsed {
echo " ******* $1 time: $(date -u -d @$(( ${3:-$(now)}-$2 )) +%T)"
}
function monitor {
launch_time=${launch_time:-$(now)}
start_time=$(now)
echo "********* $1"
shift
"$@" &
watch_child $!
elapsed Processing $start_time
elapsed Running $launch_time
}
monitor "The step to perform." my_long_running_command arg1 arg2
monitor "The next step to perform." another_long_running_command arg1 arg2Context
StackExchange Code Review Q#152320, answer score: 4
Revisions (0)
No revisions yet.