Computer Erroring Out

Author	Message
NewtonianRefractor Send message Joined: 22 May 08 Posts: 49 Credit: 2,335,997 RAC: 0	Message 41335 - Posted: 22 Dec 2010, 11:56:23 UTC Last modified: 22 Dec 2010, 12:00:05 UTC My main computer, hostid 1109774 was crunching 3 very long tasks, HadCM3 Coupled Model. Task 12009797 Task 12009783 Task 12007385 They all errored out. Furthermore the computer is trashing all new assigned models. I am out of town until the January third, so I can not manage the computer until then. There are some strange stderr on all the 3 crashed wu. Is there a way I can prevent the computer from downloading new tasks? I changed the computer preferences to not do work when the computer is idle, so hopefully it works. I can also try to change the allow network usage to some time when the server is offline. ID: 41335 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 41336 - Posted: 22 Dec 2010, 12:13:02 UTC - in response to Message 41335. In the Projects tab of your manager: Click on climateprediction.net Click the No new tasks button Backups: Here ID: 41336 · Reply Quote

NewtonianRefractor Send message Joined: 22 May 08 Posts: 49 Credit: 2,335,997 RAC: 0	Message 41337 - Posted: 22 Dec 2010, 12:35:09 UTC - in response to Message 41336. In the Projects tab of your manager: Click on climateprediction.net Click the No new tasks button I'm out of town, I don't have physical or remote access to the machine. ID: 41337 · Reply Quote

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2184 Credit: 64,822,615 RAC: 5,275	Message 41339 - Posted: 22 Dec 2010, 14:30:30 UTC You can change your climateprediction.net preferences to only download, for example, hadsm3mh (Mid-Holocene) models, which aren't being produced right now. If you have multiple computers and don't want those preference in effect for them all, you can set up that particular computer for a different venue (home,work,school) and just set preferences for that venue. ID: 41339 · Reply Quote

NewtonianRefractor Send message Joined: 22 May 08 Posts: 49 Credit: 2,335,997 RAC: 0	Message 41342 - Posted: 22 Dec 2010, 20:40:41 UTC I changed the preferences, hopefully it works. The computer has not contacted the server since 22 Dec 2010 0:55:02 UTC, which is about 20 hours ago at the time of this posting. Can someone please tell me what the stderr means for the crashed work-units? I got a good 25 days of computation on them before they crashed. That sucks. ID: 41342 · Reply Quote

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 41343 - Posted: 22 Dec 2010, 23:44:09 UTC Last modified: 22 Dec 2010, 23:56:54 UTC I'll have a stab at what some of it means. The first two models you linked to show the following near the end of the stderr and this seems to be what caused each crash: Model crashed: POTTEM STOPPING - PRESSURE OUT OF RANGE Model crashed: POTTEM STOPPING - PRESSURE OUT OF RANGE Model crashed: ERR IN FNZTOP - ITERATION HASN'T CONVERGED and Model crashed: ERR IN FNZTOP - ITERATION HASN'T CONVERGED CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=3504, iMonCtr=1 Model crash detected, will try to restart... Model crashed: POTTEM STOPPING - PRESSURE OUT OF RANGE There's nothing specific like that in the stderr of the third model but something similar may still have happened. POTTEM is some calculation of the temperatures in different layers of the ocean. The code for one model type (not ours) includes the lines: DEFINE LOCAL VARIABLES POTTEM.44 REAL POTTEM.45 & P ! PRESSURE POTTEM.46 & ,T ! TEMPERATURE POTTEM.47 & ,TB ! TEMPORARY VARIABLE POTTEM.48 & ,TA ! TEMPORARY VARIABLE POTTEM.49 & ,TEST ! ERROR TESTER POTTEM.50 & ,DP ! PRESSURE STEP POTTEM.51 POTTEM.52 REAL ATG !FUNCTION FOR ADIABATIC LAPSE RATE POTTEM.53 EXTERNAL ATG POTTEM.54 C POTTEM.55 POTTEM.56 IF (P0.LT.0.0E0.OR.P0.GT.20000.0E0 POTTEM.57 & .OR.P1.LT.0.0E0.OR.P1.GT.20000.0E0) THEN POTTEM.58 POTTEM.59 WRITE(6,)'SUBROUTINE POTTEM STOPPING - PRESSURE OUT OF RANGE' GIE0F403.474 WRITE(6,)'PRESSURES P0 AND P1 = ',P0,P1 GIE0F403.475 WRITE(6,)'ALLOWED RANGE IS 0.0 - 20,000' GIE0F403.476 I'd guess that your computer or your model has produced a value that's out of the allowed range (it may be physically impossible) so the model has to abort itself. A Met Office model (I don't know which one) contains the code: CLL======== FUNCTION fnztop =================================== FNZTOP.2 FNZTOP.5 CLL FUNCTION TO CALCULATE PRESSURE IN DECIBARS FROM DEPTH IN METRES FNZTOP.6 CLL USING AN ITERATIVE INVERSE OF SAUNDERS ALGORITHM (FUNCTION FNZTOP.7 CLL fnztop). ITERATES UNTIL THE ERROR IS ZERO, A LIMIT CYCLE IS FNZTOP.8 CLL DETECTED OR 'MLOOP' ITERATIONS REACHED. FNZTOP.9 CLL ERRROR EXIT IF ERROR > EPS. This is also concerned with ocean calculations, in this case the pressure. A calculation has to iterate or repeat and repeat until whatever the error is has been reduced to zero. But the code won't let the model keep repeating for ever in an eternal loop. After so many iterations or repeats the model self-aborts. Your model didn't manage to eliminate the error in the allowed number of repeats. So there are calculation errors in two models. The question is: 1. is this the fault of the model type or this particular model/workunit? 2. or is it the fault of your computer? You'll only know for sure that the models in these workunits are OK if other computers in the same WUs complete them. Nobody has yet. You would know for sure that these workunits are defective if other computers in the same WUs crash them with the same or similar calculation errors. I don't think that's happened yet. You need to think about possibility #2 because this appears to be the computer that was (still is?) overclocked and crashed a couple of earlier models with 'Maximum elapsed time exceeded'. Cpdn news* ID: 41343 · Reply Quote

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 41344 - Posted: 23 Dec 2010, 0:05:42 UTC The name of this model type is 'HadCM3 Coupled Model Experiment Optimised File I/O v6.04'. It's the same basic model type as the BBC model. Too many of them crashed with calculation errors. Tolu optimised the model type by slowing them down by about 20%. The slower models hardly ever produced calculation errors. I'm not saying that processing speed is necessarily the significant factor in the case of your crashed, but it could be. Mind, I don't think I've ever seen these particular calculation error messages before. Cpdn news ID: 41344 · Reply Quote

NewtonianRefractor Send message Joined: 22 May 08 Posts: 49 Credit: 2,335,997 RAC: 0	Message 41346 - Posted: 23 Dec 2010, 18:06:51 UTC Last modified: 23 Dec 2010, 18:07:34 UTC For my Task 12009783, from the Workunit 6963025, there is a paired fast intel Xeon 5160 @ 3.00GHz on linux that is not far behind in calculation. I guess I'll watch and see if it has any problems. On the other hand I do think it might be just a problem with my overclocked computer. I did dial down the overclock a bit and the PC ran fine for over 25 days with no problem before. ID: 41346 · Reply Quote

NewtonianRefractor Send message Joined: 22 May 08 Posts: 49 Credit: 2,335,997 RAC: 0	Message 41424 - Posted: 3 Jan 2011, 22:59:55 UTC Last modified: 3 Jan 2011, 23:00:19 UTC So I was finally able to check my computer. These are the boinc messages: 21-Dec-2010 12:46:50 [climateprediction.net] Requesting new tasks for GPU 21-Dec-2010 12:46:53 [climateprediction.net] Scheduler request completed: got 0 new tasks 21-Dec-2010 12:46:53 [climateprediction.net] Message from server: No work sent 21-Dec-2010 12:46:53 [climateprediction.net] Message from server: No work is available for UK Met Office HadSM3 Slab Model 21-Dec-2010 12:46:53 [climateprediction.net] Message from server: No work is available for HadCM3 Coupled Model Experiment Optimised File I/O 21-Dec-2010 12:46:53 [climateprediction.net] Message from server: No work available for the applications you have selected. Please check your settings on the web site. 21-Dec-2010 12:46:54 [climateprediction.net] Started upload of hadcm3igeo_w2kx_2000_80_06759712_0_6.zip 21-Dec-2010 12:49:43 [climateprediction.net] Finished upload of hadcm3igeo_w2kx_2000_80_06759712_0_6.zip 21-Dec-2010 13:29:59 [climateprediction.net] Computation for task hadcm3igeo_w2l0_2000_80_06759709_1 finished 21-Dec-2010 13:29:59 [climateprediction.net] Output file hadcm3igeo_w2l0_2000_80_06759709_1_7.zip for task hadcm3igeo_w2l0_2000_80_06759709_1 absent 21-Dec-2010 13:29:59 [climateprediction.net] Output file hadcm3igeo_w2l0_2000_80_06759709_1_8.zip for task hadcm3igeo_w2l0_2000_80_06759709_1 absent 21-Dec-2010 13:30:04 [climateprediction.net] Computation for task hadcm3igeo_w2kx_2000_80_06759712_0 finished 21-Dec-2010 13:30:04 [climateprediction.net] Output file hadcm3igeo_w2kx_2000_80_06759712_0_7.zip for task hadcm3igeo_w2kx_2000_80_06759712_0 absent 21-Dec-2010 13:30:04 [climateprediction.net] Output file hadcm3igeo_w2kx_2000_80_06759712_0_8.zip for task hadcm3igeo_w2kx_2000_80_06759712_0 absent 21-Dec-2010 13:30:19 [climateprediction.net] Computation for task hadcm3igeo_w2yc_2000_80_06759229_4 finished 21-Dec-2010 13:30:19 [climateprediction.net] Output file hadcm3igeo_w2yc_2000_80_06759229_4_6.zip for task hadcm3igeo_w2yc_2000_80_06759229_4 absent 21-Dec-2010 13:30:19 [climateprediction.net] Output file hadcm3igeo_w2yc_2000_80_06759229_4_7.zip for task hadcm3igeo_w2yc_2000_80_06759229_4 absent 21-Dec-2010 13:30:19 [climateprediction.net] Output file hadcm3igeo_w2yc_2000_80_06759229_4_8.zip for task hadcm3igeo_w2yc_2000_80_06759229_4 absent 21-Dec-2010 13:31:21 [climateprediction.net] Sending scheduler request: To fetch work. 21-Dec-2010 13:31:21 [climateprediction.net] Reporting 3 completed tasks, requesting new tasks for CPU 21-Dec-2010 13:31:29 [climateprediction.net] Scheduler request completed: got 3 new tasks 21-Dec-2010 13:31:31 [climateprediction.net] Started download of hadam3p_pnw_6.08_windows_intelx86.exe 21-Dec-2010 13:31:31 [climateprediction.net] Started download of hadam3p_pnw_um_6.08_windows_intelx86.zip 21-Dec-2010 13:31:37 [climateprediction.net] Finished download of hadam3p_pnw_6.08_windows_intelx86.exe 21-Dec-2010 13:31:37 [climateprediction.net] Started download of hadam3p_pnw_graphics_6.08_windows_intelx86.exe 21-Dec-2010 13:31:41 [climateprediction.net] Finished download of hadam3p_pnw_um_6.08_windows_intelx86.zip 21-Dec-2010 13:31:41 [climateprediction.net] Started download of hadam3p_pnw_se_6.08_windows_intelx86.zip 21-Dec-2010 13:31:47 [climateprediction.net] Finished download of hadam3p_pnw_se_6.08_windows_intelx86.zip 21-Dec-2010 13:31:47 [climateprediction.net] Started download of hadrm3p_pnw_um_6.08_windows_intelx86.zip 21-Dec-2010 13:31:48 [climateprediction.net] [error] File hadam3p_pnw_se_6.08_windows_intelx86.zip has wrong size: expected 987053, got 0 21-Dec-2010 13:31:48 [climateprediction.net] [error] Checksum or signature error for hadam3p_pnw_se_6.08_windows_intelx86.zip 21-Dec-2010 13:31:52 [climateprediction.net] Finished download of hadam3p_pnw_graphics_6.08_windows_intelx86.exe 21-Dec-2010 13:31:52 [climateprediction.net] Started download of hadam3p_pnw_data_6.08_windows_intelx86.zip 21-Dec-2010 13:31:53 [climateprediction.net] [error] File hadam3p_pnw_graphics_6.08_windows_intelx86.exe has wrong size: expected 2098176, got 0 21-Dec-2010 13:31:53 [climateprediction.net] [error] Checksum or signature error for hadam3p_pnw_graphics_6.08_windows_intelx86.exe 21-Dec-2010 13:31:54 [climateprediction.net] Finished download of hadam3p_pnw_data_6.08_windows_intelx86.zip 21-Dec-2010 13:31:54 [climateprediction.net] Started download of hadam3p_eu_xyo1_1960_1_007050137.zip 21-Dec-2010 13:31:55 [climateprediction.net] [error] File hadam3p_pnw_data_6.08_windows_intelx86.zip has wrong size: expected 75116, got 0 21-Dec-2010 13:31:55 [climateprediction.net] [error] Checksum or signature error for hadam3p_pnw_data_6.08_windows_intelx86.zip 21-Dec-2010 13:31:57 [climateprediction.net] Finished download of hadam3p_eu_xyo1_1960_1_007050137.zip 21-Dec-2010 13:31:57 [climateprediction.net] Started download of o3_A2_1959_2010_N96_f.anc.gz 21-Dec-2010 13:31:59 [climateprediction.net] [error] File hadam3p_eu_xyo1_1960_1_007050137.zip has wrong size: expected 12637, got 0 21-Dec-2010 13:31:59 [climateprediction.net] [error] Checksum or signature error for hadam3p_eu_xyo1_1960_1_007050137.zip 21-Dec-2010 13:32:00 [climateprediction.net] Finished download of hadrm3p_pnw_um_6.08_windows_intelx86.zip 21-Dec-2010 13:32:00 [climateprediction.net] Started download of ic19611020_10_N96.gz 21-Dec-2010 13:32:10 [climateprediction.net] Finished download of ic19611020_10_N96.gz 21-Dec-2010 13:32:10 [climateprediction.net] Started download of xaclfa.start.0000.gz 21-Dec-2010 13:32:10 [climateprediction.net] [error] File ic19611020_10_N96.gz has wrong size: expected 1314394, got 0 21-Dec-2010 13:32:10 [climateprediction.net] [error] Checksum or signature error for ic19611020_10_N96.gz 21-Dec-2010 13:32:25 [climateprediction.net] Finished download of o3_A2_1959_2010_N96_f.anc.gz 21-Dec-2010 13:32:25 [climateprediction.net] Started download of oxi.addfa.gz 21-Dec-2010 13:33:04 [climateprediction.net] Sending scheduler request: To report completed tasks. 21-Dec-2010 13:33:04 [climateprediction.net] Reporting 2 completed tasks, not requesting new tasks 21-Dec-2010 13:33:07 [climateprediction.net] Scheduler request completed 21-Dec-2010 13:33:09 [climateprediction.net] [error] File hadam3p_eu_6.08_windows_intelx86.exe has wrong size: expected 780288, got 0 21-Dec-2010 13:33:12 [climateprediction.net] Sending scheduler request: To fetch work. 21-Dec-2010 13:33:12 [climateprediction.net] Not reporting or requesting tasks 21-Dec-2010 13:33:13 [climateprediction.net] Scheduler request completed 21-Dec-2010 13:33:19 [climateprediction.net] Sending scheduler request: To fetch work. 21-Dec-2010 13:33:19 [climateprediction.net] Reporting 1 completed tasks, requesting new tasks for CPU 21-Dec-2010 13:33:21 [climateprediction.net] Scheduler request completed: got 0 new tasks 21-Dec-2010 13:33:21 [climateprediction.net] Message from server: No work sent 21-Dec-2010 13:33:21 [climateprediction.net] Message from server: No work is available for UK Met Office HadSM3 Slab Model 21-Dec-2010 13:33:21 [climateprediction.net] Message from server: No work is available for HadCM3 Coupled Model Experiment Optimised File I/O 21-Dec-2010 13:33:21 [climateprediction.net] Message from server: No work is available for UK Met Office HADAM3P European Region 21-Dec-2010 13:33:21 [climateprediction.net] Message from server: No work is available for UK Met Office HADAM3P Pacific North West 21-Dec-2010 13:33:21 [climateprediction.net] Message from server: (reached daily quota of 3 tasks) ID: 41424 · Reply Quote

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2184 Credit: 64,822,615 RAC: 5,275	Message 41425 - Posted: 4 Jan 2011, 0:14:04 UTC Two of the hadcm3's crashed after the 60th year/trickle and the other after the 57th, all at the same time as you said. Since then, that PC has not been able to correctly download any models. My guess is a hardware problem. Could be a hard disk issue, a memory issue, or a processor issue. I would test the system with Prime95 for awhile, and any hardware diagnostic software you can (memtest86+, your hard drive manufacturer's diagnostic tests). Until those tests show clean, don't try to download any more boinc tasks. ID: 41425 · Reply Quote

NewtonianRefractor Send message Joined: 22 May 08 Posts: 49 Credit: 2,335,997 RAC: 0	Message 41426 - Posted: 4 Jan 2011, 1:27:04 UTC Last modified: 4 Jan 2011, 1:27:16 UTC That's very interesting because when I accessed the computer it seemed that it was running fine. The up-time was 18 days (I rebooted before I left on winter vacation). Boinc was running and was responsive. It was interesting that it did not contact the server after the 22nd. In the message log it said that it was just running CPU benchmarks every once in a while. ID: 41426 · Reply Quote

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2184 Credit: 64,822,615 RAC: 5,275	Message 41427 - Posted: 4 Jan 2011, 3:17:26 UTC If it's just the software, you could try to do a reset on the climateprediction.net project. That should flush any troublesome files and download a new batch. Of course if it's a hardware error, eventually something similar will happen again. ID: 41427 · Reply Quote

NewtonianRefractor Send message Joined: 22 May 08 Posts: 49 Credit: 2,335,997 RAC: 0	Message 41428 - Posted: 4 Jan 2011, 6:41:17 UTC - in response to Message 41427. I am overclocking the computer, but the problem is that if it is a hardware error related to this it manifests itself very rarely. The models made it to year 60, which is a lot of calculation to go without error. I imagine this is very difficult is not impossible for me to track down. ID: 41428 · Reply Quote

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2184 Credit: 64,822,615 RAC: 5,275	Message 41430 - Posted: 4 Jan 2011, 13:58:18 UTC - in response to Message 41428. Right now, as it's not downloading any new work, it seems like something has gotten corrupted in your boincdata directory. This is probably because of a hardware error, and quite likely a result of overclocking. If you can't run Prime95 for several hours without an error, cpdn work will crash, it is that simple. Prime95 is more taxing on the system than cpdn and should show an error if there is a cpu or memory problem. If it's hard disk corruption for whatever reason, prime95 likely wouldn't show that, but could also explain the errors listed on your task result webpages. ID: 41430 · Reply Quote

NewtonianRefractor Send message Joined: 22 May 08 Posts: 49 Credit: 2,335,997 RAC: 0	Message 41431 - Posted: 4 Jan 2011, 16:47:59 UTC - in response to Message 41430. I ran checkdisk on the hard-drive with no errors. (it's a 2 TB raid 0 array) I am running prime95 'blend' mode right now. I will run it for 48 hours. ID: 41431 · Reply Quote