climateprediction.net home page
Regional HADAM3P Failures

Regional HADAM3P Failures

Message boards : Number crunching : Regional HADAM3P Failures
Message board moderation

To post messages, you must log in.

AuthorMessage
old_user596405

Send message
Joined: 4 Oct 09
Posts: 73
Credit: 7,242,427
RAC: 0
Message 40620 - Posted: 8 Sep 2010, 7:26:53 UTC
Last modified: 8 Sep 2010, 7:54:08 UTC

Are we expecting random crashes with the new models - like FAMOUS negative thetas?

Got first failure last night. This one (11871074) crashed after 4 trickles (> 33%).

Failed on a slighty overclocked Q6600 / 4gb / Win 7 64 system (probably with the best record for completions) that had already finished 4 assorted regional models.

It would have been interesting to see if a restore would have been successful but I no longer bother taking backups for today's shorter model types. If any more fail will start backing up again just to find out.

Apart from anticipated FAMOUS thetas and SM3 iceballs, the only other failures experienced over past year or so would have been caused by occasional power supply problems.

Just curious.
ID: 40620 · Report as offensive     Reply Quote
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 40621 - Posted: 8 Sep 2010, 10:58:33 UTC

These regional models don't generate the same THETA and PRESSURE errors as some FAMOUS and should almost always complete successfully. Unfortunately yours is the only model given out from that workunit so you can't see whether another computer with Intel and Windows has completed it. (Usually one can only compare directly with machines that have the same CPU type and OS.)

-161 appears to be a Boinc error. Here's what it means. And here's Signal 11. In your case I don't think those explanations help much; we only know that something went wrong. But it does look as if something happened.

The computer is processing pretty fast. 1.7 sec/TS for that model. My quad 6600 has two European models running alongside two FAMOUS; they're running at 2.51 sec/TS. If you do get more regional crashes I'd do the stability tests and, as you suggest, start taking backups. If you restore a backup because the task crashed and the restore gets past the failure point you do know for sure that the problem lies within the computer.
Cpdn news
ID: 40621 · Report as offensive     Reply Quote
old_user596405

Send message
Joined: 4 Oct 09
Posts: 73
Credit: 7,242,427
RAC: 0
Message 40622 - Posted: 8 Sep 2010, 13:54:30 UTC - in response to Message 40621.  
Last modified: 8 Sep 2010, 13:55:17 UTC

These regional models don't generate the same THETA and PRESSURE errors as some FAMOUS and should almost always complete successfully. Unfortunately yours is the only model given out from that workunit so you can't see whether another computer with Intel and Windows has completed it. (Usually one can only compare directly with machines that have the same CPU type and OS.)

-161 appears to be a Boinc error. Here's what it means. And here's Signal 11. In your case I don't think those explanations help much; we only know that something went wrong. But it does look as if something happened.

The computer is processing pretty fast. 1.7 sec/TS for that model. My quad 6600 has two European models running alongside two FAMOUS; they're running at 2.51 sec/TS. If you do get more regional crashes I'd do the stability tests and, as you suggest, start taking backups. If you restore a backup because the task crashed and the restore gets past the failure point you do know for sure that the problem lies within the computer.


Many thanks for info. Ok, so we should not expect model issues with these regional ones. Thought so (else we would have been advised!).

As stated, this machine has a good track record. The OC level has been well tuned and tested some time ago but I suppose it is worth doing occasional checks. Even recently dusted out the system when replacing the PSU. Temps are fine - same as always. GPUGRid is running in a modest, cool, low power card and for many months without any hiccups.

Am puzzled about the 1.7 s/TS though? Including the failure, this machine has completed or is still running 8 other models. s/TS ranges from 1.86 to 2.26 (the latter a pair of PNW versions). As seen in trickle pages. My other quad running at the stock 2.4 has just started its first regional model. Running at 2.8 s/TS.

Back to backups then. Hate losing models. At least, unlike FAMOUS, it will be worth taking backups for regional models.

Thanks again.
ID: 40622 · Report as offensive     Reply Quote

Message boards : Number crunching : Regional HADAM3P Failures

©2024 cpdn.org