|Currently there may be errors shown on top of a page, because of a missing Wiki update (PHP version and extension DPL3).|
Overclocking and Prime95
- 1 Can I ignore an overclocking problem?
- 2 How long should I run the torture test?
- 3 Prime95 reports errors during the torture test, others do not?
- 4 Should I lower FSB for a higher multiplier?
- 5 What to do if a problem is found during stress testing?
- 6 What type of work should I use to burn-in a box?
- 7 Where can I find a list of OC tools?
- 8 Why is my machine failing the self test?
- 9 Why is Prime95 good for testing an overclocked machine?
- 10 Why should I stress test my overclocked machine?
Can I ignore an overclocking problem?
Ignoring an overclocking problem is a matter of personal preference. There are two schools of thought on this subject:
- Most programs you run will not stress your computer enough to cause a wrong result or system crash. If a few games stress your machine and your system crashes, it is not important anyway. If you belong to this group stay away from distributed computing projects where an incorrect calculation might cause you to return wrong results. You are not helping these projects by returning bad data! In conclusion, if you are comfortable with a small risk of an occasional system crash then feel free to live a little dangerously!
- The second school of thought is, "Why run a stress test if you are going to ignore the results?" You and the project want a guaranteed 100% rock solid machine. Passing these stability tests gives you the ability to run CPU intensive programs with confidence.
Packages that manipulate extremely large integers generally make use of Fast Fourier transforms (FFTs) to perform multiplications. The FFTs are floating-point but the initial and final values of the process must be integer. Because of rounding errors the results won't be exact but the accumulated error must be small enough so that you can identify reliably the integer each value is supposed to be. You might decide that an error of, say, +/-0.1 is acceptable. So if you end up with a value of 1.92 or 2.06 then you know the correct value is 2. If you end up with 2.2 then the calculation is invalid. If you proved that the maximum accumulated error cannot be more than +/-0.1 on a correct calculation then a processing error must have occurred.
What is important to understand is that these calculations must always have a sufficiently small error. A machine that has been overclocked too far may generate unacceptable errors. The difference can actually be very small - but it still is an error.
Floating point operations do introduce some sort of error. However, it is also clear that some people would rather be fast than right, and don't really care how much error slips in. This may be acceptable for gaming, but it is not acceptable for Distributed computing projects (and most other applications), and project designers simply have to account for this.
One can make a case that anyone who knowingly runs an overclock that fails the Prime95 torture test is actually cheating and sabotaging whatever project they run, since they are deliberately returning bad results.
Why? Precisely because their computer is turning in wrong answers beyond the expected floating point error rates (even if "a little bit"). The problem is, one more bit off here and there may matter very little or it may change the whole answer. Once they escape any planned bounds, errors, by their nature, have outcomes that are hard to control. Even the low order bit may matter as much as a high order bit. You can't predict the effect of (what amounts to) flipping random bits in the answer.
What is the difference, really, between turning in made-up work units that happen to pass the security checks and knowingly turning in incorrectly calculated ones?
No-one is forced to check their overclock, but at some point, it becomes willful ignorance not to check the machine's accuracy better, such as running Prime95 or something else that has extensive self-checks.
Prime95 operates with very specific levels of acceptable errors and is capable of identifying errors which are outside the bounds. Because it is hard. It is easy to send out the same WU two or three times and then see if they match within a specified floating point tolerance overall. It is much harder to build in "as you go" range checking that actually works. Some problems are hospitable to this sort of thing and others are not.
Mathematical projects like GIMPS are probably always going to have advantages for this sort of thing.
For instance, if you use FFT with its floating point math for rapidly performing an integer multiply, you are automatically dealing very directly with floating point rounding error. You simply have to account for it.
But, if you're simply bashing matrices (using FFT or not) on ordinary, "live" data whose values may be relatively unbounded (that's why you're doing it) then what you can do is program something which achieves floating point convergence only. You then rely on double checks and the fact that your algorithm properly converges (this convergence is not a DC exclusive - it must be done for any serious floating point based project).
So, the short answer is, the ability of the client to check "as it goes" as opposed to at the bitter end of a redundant calculation varies by the client and the opportunities the algorithms involved create.
In addition, the certainty that a bad answer is a bad overclock may correspondingly vary. If you detect an "outlier" answer and dispense with it, shutting off credit is a matter of how certain one is that it is a bad overclock. That is, certainty about "bad hardware" as opposed to some possibly good hardware that still produces answers that fall outside of the other results due to occasional floating point hiccups at some unfavorable moment. We are, after all, talking about fairly subtle errors in any event. If the error was really gross, presumably the calculation would simply crash or something as opposed to running for hours or days but turning up a wrong answer.
However, if a GIMPS type stress test finds something, then the fact that the hardware has successfully POSTed or booted, is not enough. A lot of code, such as boot code, is quite capable of overlooking a couple of errors of many kinds. Failing to boot, after all, leads to some serious service calls that often turn out, upon closer inspection, to be better dealt with by limping to the OS main screen if at all possible and using the full OS capabilities. So, whether overclockers know it or not, passing POST and even boot is not all that impressive.
How long should I run the torture test?
It is recommended to run it for somewhere between 6 and 24 hours. The program has been known to fail only after several hours and in some cases several weeks of operation. In most cases though, it will fail within a few minutes on a flaky machine.
Prime95 reports errors during the torture test, others do not?
Yes, you've reached the point where your machine has been pushed beyond its limits. Step the overclock back a few notches and run the torture test again until your machine is 100% stable or decide to live with a machine that could have problems in rare circumstances, but then do not run any distributed computing project.
Should I lower FSB for a higher multiplier?
Prime95 and Seti clients will respond better to a higher FSB, as the bandwidth will increase with the overclock versus merely upping the multiplier. You would have to test this, you already do the prime benchmark and for Seti you could try the reference WU hanging around the TLC site. Get things working stable and then drop the FSB by 1 MHz to give you a better safety factor.
What to do if a problem is found during stress testing?
The exact cause of a hardware problem can be very hard to find.
If you are not overclocking, the most likely cause is an overheating CPU or memory DIMMs that are not quite up to spec. Another possibility is you might need a better power supply. Try running (MotherBoard monitor) and browse hardware forums to see if your CPU is running too hot. If so, make sure the heat sink is properly attached, fans are operational, and air flow inside the case is good. For isolating memory problems, try swapping memory DIMMs with a co-worker's or friend's machine. If the errors go away, then you can be fairly confident that memory was the cause of the trouble. A power supply problem can often be identified by a dangerous drop in the voltages when Prime95 starts running. Once again the overclocker forums are a good resource for what voltages are acceptable.
If you are overclocking then try increasing the core voltage, reduce the CPU speed, reduce the front side bus speed, or change the memory timings (CAS latency). Also try asking for help in one of the forums - they may have other ideas to try.
What type of work should I use to burn-in a box?
Use the Torture test.
Where can I find a list of OC tools?
You can find one here
Why is my machine failing the self test?
How much are you overclocked? Have you run SuperPi to a decent level to make sure the FPU was still sane? Did you switch from factoring to primality testing, or are you doing the same type of work? Put your box back at the default speed and see if that fixes it. Also run the Prime95 torture test for a long time, at least for 24 hours.
Packages that manipulate extremely large integers generally make use of FFTs to perform multiplications. The FFTs are floating point but the initial and final values of the process must be integer. Because of rounding errors the results won't be exact but the accumulated error must be small enough so that you can identify reliably the integer each value is supposed to be. You might decide that an error of, say, +/-0.1 is acceptable. So if you end up with a value of 1.92 or 2.06 then you know the correct value is 2. If you end up with 2.2 then the calculation is invalid. If you proved that the maximum accumulated error cannot be more than +/-0.1 on a correct calculation then a processing error must have occurred.
Why is Prime95 good for testing an overclocked machine?
The GIMPS prime program is a very good stress test for the CPU, memory, caches, CPU cooling, and case cooling. The torture test runs continuously, comparing your computer's results, to results that are known to be correct. Any mismatch and you've got a problem! Note that the torture test sometimes reads from and writes to disk but cannot be considered a stress test for hard drives.
You'll need other programs to stress video cards, PCI bus, disk access, networking and other important components. In addition, this is only one of several good programs that are freely available. Some people report better finding problems only when running two or more stress test programs concurrently. You may need to raise Prime95's priority when running two stress test programs so that each gets about 50% of the CPU time.
Why should I stress test my overclocked machine?
Today's computers are not perfect. Even brand new systems from major manufacturers can have hidden flaws. If any of several key components such as CPU, memory, cooling, etc. are not up to spec, it can lead to incorrect calculations and/or unexplained system crashes.
Overclocking is the practice of trying to increase the speed of the CPU and memory in an effort to make a machine faster at little cost. Typically, overclocking involves pushing their machine to the limits and then backing off just a little bit.
For these reasons, both non-overclockers and overclockers need programs that tess the stability of their computers. This is done by running programs that put a very heavy load on the computer. Though not originally designed for this purpose, the GIMPS prime program is one of a few programs that are excellent at stress testing a computer.