debugMajor
Why do we rely on computers in critical fields?
Viewed 0 times
whycriticalrelyfieldscomputers
Problem
I assume that computers make many mistakes (like errors, bugs, glitches, etc.), which can be observed from the amount of questions asked everyday on different communities (like Stack Overflow) showing people trying to fix such issues.
If computers really make many errors (as I assumed earlier) then critical tasks (like signing in or receiving a receipt) must be designed to be almost error-free, unlike most of the tasks of most software and video games.
If computers really make many errors (as I assumed earlier) then critical tasks (like signing in or receiving a receipt) must be designed to be almost error-free, unlike most of the tasks of most software and video games.
Solution
Computer hardware almost always does exactly what software tells it to do.
It's useful to distinguish software bugs from unreliable hardware.
Cosmic rays can randomly flip a bit in memory, though; that's why servers often use ECC (Error Correction Code) memory to correct single-bit errors and detect most multi-bit errors. (And internally, CPUs usually use ECC for their caches.)
Computers that need even more reliability and availability1 than that, like flight computers in aircraft or space craft, often have 3 separate computers processing the same inputs. (Triple Modular Redundancy) If all 3 produce the same output, great, it's almost certainly correct. (Especially if each of the 3 computers is running software written by different teams.) If only 2 out of the 3 outputs match, the odd one out is assumed wrong, so it gets reset and the system uses the outputs of the remaining two until the faulty one is rebooted and agreeing with them. If all 3 systems give different outputs, you have a big problem. If 2 systems give the same wrong answer, that's even worse.
Safety-critical systems like those are programmed in software much more carefully than video games or even mainstream OSes. Practices like avoiding dynamic memory allocation (
Footnote 1: detecting an error and rebooting is not sufficient when the system is part of the flight controls of a jet plane that could crash if the controls stopped responding for half a second.
Related:
It's useful to distinguish software bugs from unreliable hardware.
Cosmic rays can randomly flip a bit in memory, though; that's why servers often use ECC (Error Correction Code) memory to correct single-bit errors and detect most multi-bit errors. (And internally, CPUs usually use ECC for their caches.)
Computers that need even more reliability and availability1 than that, like flight computers in aircraft or space craft, often have 3 separate computers processing the same inputs. (Triple Modular Redundancy) If all 3 produce the same output, great, it's almost certainly correct. (Especially if each of the 3 computers is running software written by different teams.) If only 2 out of the 3 outputs match, the odd one out is assumed wrong, so it gets reset and the system uses the outputs of the remaining two until the faulty one is rebooted and agreeing with them. If all 3 systems give different outputs, you have a big problem. If 2 systems give the same wrong answer, that's even worse.
Safety-critical systems like those are programmed in software much more carefully than video games or even mainstream OSes. Practices like avoiding dynamic memory allocation (
malloc) remove whole classes of bugs and possible corner cases.Footnote 1: detecting an error and rebooting is not sufficient when the system is part of the flight controls of a jet plane that could crash if the controls stopped responding for half a second.
Related:
- How do redundancies work in aircraft systems?
- How dissimilar are redundant flight control computers? - not only do they have redundant systems, they're often built from different hardware running different software. So any power glitch or other weird thing is likely to have different effects on them, hopefully avoiding the case of multiple wrong answers out-voting a correct answer.
Context
StackExchange Computer Science Q#153158, answer score: 26
Revisions (0)
No revisions yet.