Message boards :
Number crunching :
Computer Erroring Out
Message board moderation
Author | Message |
---|---|
Send message Joined: 22 May 08 Posts: 49 Credit: 2,335,997 RAC: 0 |
My main computer, hostid 1109774 was crunching 3 very long tasks, HadCM3 Coupled Model. Task 12009797 Task 12009783 Task 12007385 They all errored out. Furthermore the computer is trashing all new assigned models. I am out of town until the January third, so I can not manage the computer until then. There are some strange stderr on all the 3 crashed wu. Is there a way I can prevent the computer from downloading new tasks? I changed the computer preferences to not do work when the computer is idle, so hopefully it works. I can also try to change the allow network usage to some time when the server is offline. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
In the Projects tab of your manager: Click on climateprediction.net Click the No new tasks button Backups: Here |
Send message Joined: 22 May 08 Posts: 49 Credit: 2,335,997 RAC: 0 |
In the Projects tab of your manager: I'm out of town, I don't have physical or remote access to the machine. |
Send message Joined: 7 Aug 04 Posts: 2184 Credit: 64,822,615 RAC: 5,275 |
You can change your climateprediction.net preferences to only download, for example, hadsm3mh (Mid-Holocene) models, which aren't being produced right now. If you have multiple computers and don't want those preference in effect for them all, you can set up that particular computer for a different venue (home,work,school) and just set preferences for that venue. |
Send message Joined: 22 May 08 Posts: 49 Credit: 2,335,997 RAC: 0 |
I changed the preferences, hopefully it works. The computer has not contacted the server since 22 Dec 2010 0:55:02 UTC, which is about 20 hours ago at the time of this posting. Can someone please tell me what the stderr means for the crashed work-units? I got a good 25 days of computation on them before they crashed. That sucks. |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
I'll have a stab at what some of it means. The first two models you linked to show the following near the end of the stderr and this seems to be what caused each crash: Model crashed: POTTEM STOPPING - PRESSURE OUT OF RANGEand Model crashed: ERR IN FNZTOP - ITERATION HASN'T CONVERGED There's nothing specific like that in the stderr of the third model but something similar may still have happened. POTTEM is some calculation of the temperatures in different layers of the ocean. The code for one model type (not ours) includes the lines: DEFINE LOCAL VARIABLES POTTEM.44 I'd guess that your computer or your model has produced a value that's out of the allowed range (it may be physically impossible) so the model has to abort itself. A Met Office model (I don't know which one) contains the code: CLL======== FUNCTION fnztop =================================== FNZTOP.2 This is also concerned with ocean calculations, in this case the pressure. A calculation has to iterate or repeat and repeat until whatever the error is has been reduced to zero. But the code won't let the model keep repeating for ever in an eternal loop. After so many iterations or repeats the model self-aborts. Your model didn't manage to eliminate the error in the allowed number of repeats. So there are calculation errors in two models. The question is: 1. is this the fault of the model type or this particular model/workunit? 2. or is it the fault of your computer? You'll only know for sure that the models in these workunits are OK if other computers in the same WUs complete them. Nobody has yet. You would know for sure that these workunits are defective if other computers in the same WUs crash them with the same or similar calculation errors. I don't think that's happened yet. You need to think about possibility #2 because this appears to be the computer that was (still is?) overclocked and crashed a couple of earlier models with 'Maximum elapsed time exceeded'. Cpdn news |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
The name of this model type is 'HadCM3 Coupled Model Experiment Optimised File I/O v6.04'. It's the same basic model type as the BBC model. Too many of them crashed with calculation errors. Tolu optimised the model type by slowing them down by about 20%. The slower models hardly ever produced calculation errors. I'm not saying that processing speed is necessarily the significant factor in the case of your crashed, but it could be. Mind, I don't think I've ever seen these particular calculation error messages before. Cpdn news |
Send message Joined: 22 May 08 Posts: 49 Credit: 2,335,997 RAC: 0 |
For my Task 12009783, from the Workunit 6963025, there is a paired fast intel Xeon 5160 @ 3.00GHz on linux that is not far behind in calculation. I guess I'll watch and see if it has any problems. On the other hand I do think it might be just a problem with my overclocked computer. I did dial down the overclock a bit and the PC ran fine for over 25 days with no problem before. |
Send message Joined: 22 May 08 Posts: 49 Credit: 2,335,997 RAC: 0 |
So I was finally able to check my computer. These are the boinc messages:
|
Send message Joined: 7 Aug 04 Posts: 2184 Credit: 64,822,615 RAC: 5,275 |
Two of the hadcm3's crashed after the 60th year/trickle and the other after the 57th, all at the same time as you said. Since then, that PC has not been able to correctly download any models. My guess is a hardware problem. Could be a hard disk issue, a memory issue, or a processor issue. I would test the system with Prime95 for awhile, and any hardware diagnostic software you can (memtest86+, your hard drive manufacturer's diagnostic tests). Until those tests show clean, don't try to download any more boinc tasks. |
Send message Joined: 22 May 08 Posts: 49 Credit: 2,335,997 RAC: 0 |
That's very interesting because when I accessed the computer it seemed that it was running fine. The up-time was 18 days (I rebooted before I left on winter vacation). Boinc was running and was responsive. It was interesting that it did not contact the server after the 22nd. In the message log it said that it was just running CPU benchmarks every once in a while. |
Send message Joined: 7 Aug 04 Posts: 2184 Credit: 64,822,615 RAC: 5,275 |
If it's just the software, you could try to do a reset on the climateprediction.net project. That should flush any troublesome files and download a new batch. Of course if it's a hardware error, eventually something similar will happen again. |
Send message Joined: 22 May 08 Posts: 49 Credit: 2,335,997 RAC: 0 |
I am overclocking the computer, but the problem is that if it is a hardware error related to this it manifests itself very rarely. The models made it to year 60, which is a lot of calculation to go without error. I imagine this is very difficult is not impossible for me to track down. |
Send message Joined: 7 Aug 04 Posts: 2184 Credit: 64,822,615 RAC: 5,275 |
Right now, as it's not downloading any new work, it seems like something has gotten corrupted in your boincdata directory. This is probably because of a hardware error, and quite likely a result of overclocking. If you can't run Prime95 for several hours without an error, cpdn work will crash, it is that simple. Prime95 is more taxing on the system than cpdn and should show an error if there is a cpu or memory problem. If it's hard disk corruption for whatever reason, prime95 likely wouldn't show that, but could also explain the errors listed on your task result webpages. |
Send message Joined: 22 May 08 Posts: 49 Credit: 2,335,997 RAC: 0 |
I ran checkdisk on the hard-drive with no errors. (it's a 2 TB raid 0 array) I am running prime95 'blend' mode right now. I will run it for 48 hours. |
©2024 cpdn.org