climateprediction.net home page
Computer Erroring Out

Computer Erroring Out

Message boards : Number crunching : Computer Erroring Out
Message board moderation

To post messages, you must log in.

AuthorMessage
NewtonianRefractor

Send message
Joined: 22 May 08
Posts: 49
Credit: 2,335,997
RAC: 0
Message 41335 - Posted: 22 Dec 2010, 11:56:23 UTC
Last modified: 22 Dec 2010, 12:00:05 UTC

My main computer, hostid 1109774 was crunching 3 very long tasks, HadCM3 Coupled Model.

Task 12009797
Task 12009783
Task 12007385

They all errored out. Furthermore the computer is trashing all new assigned models.

I am out of town until the January third, so I can not manage the computer until then.

There are some strange stderr on all the 3 crashed wu.

Is there a way I can prevent the computer from downloading new tasks? I changed the computer preferences to not do work when the computer is idle,
so hopefully it works. I can also try to change the allow network usage to some time when the server is offline.
ID: 41335 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 41336 - Posted: 22 Dec 2010, 12:13:02 UTC - in response to Message 41335.  

In the Projects tab of your manager:
Click on climateprediction.net
Click the No new tasks button


Backups: Here
ID: 41336 · Report as offensive     Reply Quote
NewtonianRefractor

Send message
Joined: 22 May 08
Posts: 49
Credit: 2,335,997
RAC: 0
Message 41337 - Posted: 22 Dec 2010, 12:35:09 UTC - in response to Message 41336.  

In the Projects tab of your manager:
Click on climateprediction.net
Click the No new tasks button



I'm out of town, I don't have physical or remote access to the machine.
ID: 41337 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2184
Credit: 64,822,615
RAC: 5,275
Message 41339 - Posted: 22 Dec 2010, 14:30:30 UTC

You can change your climateprediction.net preferences to only download, for example, hadsm3mh (Mid-Holocene) models, which aren't being produced right now. If you have multiple computers and don't want those preference in effect for them all, you can set up that particular computer for a different venue (home,work,school) and just set preferences for that venue.
ID: 41339 · Report as offensive     Reply Quote
NewtonianRefractor

Send message
Joined: 22 May 08
Posts: 49
Credit: 2,335,997
RAC: 0
Message 41342 - Posted: 22 Dec 2010, 20:40:41 UTC

I changed the preferences, hopefully it works. The computer has not contacted the server since 22 Dec 2010 0:55:02 UTC, which is about 20 hours ago at the time of this posting.

Can someone please tell me what the stderr means for the crashed work-units? I got a good 25 days of computation on them before they crashed. That sucks.
ID: 41342 · Report as offensive     Reply Quote
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 41343 - Posted: 22 Dec 2010, 23:44:09 UTC
Last modified: 22 Dec 2010, 23:56:54 UTC

I'll have a stab at what some of it means. The first two models you linked to show the following near the end of the stderr and this seems to be what caused each crash:

Model crashed: POTTEM STOPPING - PRESSURE OUT OF RANGE
Model crashed: POTTEM STOPPING - PRESSURE OUT OF RANGE
Model crashed: ERR IN FNZTOP - ITERATION HASN'T CONVERGED
and
Model crashed: ERR IN FNZTOP - ITERATION HASN'T CONVERGED
CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=3504, iMonCtr=1
Model crash detected, will try to restart...
Model crashed: POTTEM STOPPING - PRESSURE OUT OF RANGE

There's nothing specific like that in the stderr of the third model but something similar may still have happened.

POTTEM is some calculation of the temperatures in different layers of the ocean. The code for one model type (not ours) includes the lines:

DEFINE LOCAL VARIABLES POTTEM.44
REAL POTTEM.45
& P ! PRESSURE POTTEM.46
& ,T ! TEMPERATURE POTTEM.47
& ,TB ! TEMPORARY VARIABLE POTTEM.48
& ,TA ! TEMPORARY VARIABLE POTTEM.49
& ,TEST ! ERROR TESTER POTTEM.50
& ,DP ! PRESSURE STEP POTTEM.51
POTTEM.52
REAL ATG !FUNCTION FOR ADIABATIC LAPSE RATE POTTEM.53
EXTERNAL ATG POTTEM.54
C POTTEM.55
POTTEM.56
IF (P0.LT.0.0E0.OR.P0.GT.20000.0E0 POTTEM.57
& .OR.P1.LT.0.0E0.OR.P1.GT.20000.0E0) THEN POTTEM.58
POTTEM.59
WRITE(6,*)'SUBROUTINE POTTEM STOPPING - PRESSURE OUT OF RANGE' GIE0F403.474
WRITE(6,*)'PRESSURES P0 AND P1 = ',P0,P1 GIE0F403.475
WRITE(6,*)'ALLOWED RANGE IS 0.0 - 20,000' GIE0F403.476

I'd guess that your computer or your model has produced a value that's out of the allowed range (it may be physically impossible) so the model has to abort itself.

A Met Office model (I don't know which one) contains the code:

CLL======== FUNCTION fnztop =================================== FNZTOP.2
FNZTOP.5
CLL FUNCTION TO CALCULATE PRESSURE IN DECIBARS FROM DEPTH IN METRES FNZTOP.6
CLL USING AN ITERATIVE INVERSE OF SAUNDERS ALGORITHM (FUNCTION FNZTOP.7
CLL fnztop). ITERATES UNTIL THE ERROR IS ZERO, A LIMIT CYCLE IS FNZTOP.8
CLL DETECTED OR 'MLOOP' ITERATIONS REACHED. FNZTOP.9
CLL ERRROR EXIT IF ERROR > EPS.

This is also concerned with ocean calculations, in this case the pressure. A calculation has to iterate or repeat and repeat until whatever the error is has been reduced to zero. But the code won't let the model keep repeating for ever in an eternal loop. After so many iterations or repeats the model self-aborts. Your model didn't manage to eliminate the error in the allowed number of repeats.

So there are calculation errors in two models. The question is:

1. is this the fault of the model type or this particular model/workunit?

2. or is it the fault of your computer?

You'll only know for sure that the models in these workunits are OK if other computers in the same WUs complete them. Nobody has yet.

You would know for sure that these workunits are defective if other computers in the same WUs crash them with the same or similar calculation errors. I don't think that's happened yet.

You need to think about possibility #2 because this appears to be the computer that was (still is?) overclocked and crashed a couple of earlier models with 'Maximum elapsed time exceeded'.
Cpdn news
ID: 41343 · Report as offensive     Reply Quote
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 41344 - Posted: 23 Dec 2010, 0:05:42 UTC

The name of this model type is 'HadCM3 Coupled Model Experiment Optimised File I/O v6.04'. It's the same basic model type as the BBC model. Too many of them crashed with calculation errors. Tolu optimised the model type by slowing them down by about 20%. The slower models hardly ever produced calculation errors.

I'm not saying that processing speed is necessarily the significant factor in the case of your crashed, but it could be.

Mind, I don't think I've ever seen these particular calculation error messages before.
Cpdn news
ID: 41344 · Report as offensive     Reply Quote
NewtonianRefractor

Send message
Joined: 22 May 08
Posts: 49
Credit: 2,335,997
RAC: 0
Message 41346 - Posted: 23 Dec 2010, 18:06:51 UTC
Last modified: 23 Dec 2010, 18:07:34 UTC

For my Task 12009783, from the Workunit 6963025, there is a paired fast intel Xeon 5160 @ 3.00GHz on linux that is not far behind in calculation.

I guess I'll watch and see if it has any problems. On the other hand I do think it might be just a problem with my overclocked computer. I did dial down the overclock a bit and the PC ran fine for over 25 days with no problem before.
ID: 41346 · Report as offensive     Reply Quote
NewtonianRefractor

Send message
Joined: 22 May 08
Posts: 49
Credit: 2,335,997
RAC: 0
Message 41424 - Posted: 3 Jan 2011, 22:59:55 UTC
Last modified: 3 Jan 2011, 23:00:19 UTC

So I was finally able to check my computer.

These are the boinc messages:



21-Dec-2010 12:46:50 [climateprediction.net] Requesting new tasks for GPU
21-Dec-2010 12:46:53 [climateprediction.net] Scheduler request completed: got 0 new tasks
21-Dec-2010 12:46:53 [climateprediction.net] Message from server: No work sent
21-Dec-2010 12:46:53 [climateprediction.net] Message from server: No work is available for UK Met Office HadSM3 Slab Model
21-Dec-2010 12:46:53 [climateprediction.net] Message from server: No work is available for HadCM3 Coupled Model Experiment Optimised File I/O
21-Dec-2010 12:46:53 [climateprediction.net] Message from server: No work available for the applications you have selected. Please check your settings on the web site.
21-Dec-2010 12:46:54 [climateprediction.net] Started upload of hadcm3igeo_w2kx_2000_80_06759712_0_6.zip
21-Dec-2010 12:49:43 [climateprediction.net] Finished upload of hadcm3igeo_w2kx_2000_80_06759712_0_6.zip
21-Dec-2010 13:29:59 [climateprediction.net] Computation for task hadcm3igeo_w2l0_2000_80_06759709_1 finished
21-Dec-2010 13:29:59 [climateprediction.net] Output file hadcm3igeo_w2l0_2000_80_06759709_1_7.zip for task hadcm3igeo_w2l0_2000_80_06759709_1 absent
21-Dec-2010 13:29:59 [climateprediction.net] Output file hadcm3igeo_w2l0_2000_80_06759709_1_8.zip for task hadcm3igeo_w2l0_2000_80_06759709_1 absent
21-Dec-2010 13:30:04 [climateprediction.net] Computation for task hadcm3igeo_w2kx_2000_80_06759712_0 finished
21-Dec-2010 13:30:04 [climateprediction.net] Output file hadcm3igeo_w2kx_2000_80_06759712_0_7.zip for task hadcm3igeo_w2kx_2000_80_06759712_0 absent
21-Dec-2010 13:30:04 [climateprediction.net] Output file hadcm3igeo_w2kx_2000_80_06759712_0_8.zip for task hadcm3igeo_w2kx_2000_80_06759712_0 absent
21-Dec-2010 13:30:19 [climateprediction.net] Computation for task hadcm3igeo_w2yc_2000_80_06759229_4 finished
21-Dec-2010 13:30:19 [climateprediction.net] Output file hadcm3igeo_w2yc_2000_80_06759229_4_6.zip for task hadcm3igeo_w2yc_2000_80_06759229_4 absent
21-Dec-2010 13:30:19 [climateprediction.net] Output file hadcm3igeo_w2yc_2000_80_06759229_4_7.zip for task hadcm3igeo_w2yc_2000_80_06759229_4 absent
21-Dec-2010 13:30:19 [climateprediction.net] Output file hadcm3igeo_w2yc_2000_80_06759229_4_8.zip for task hadcm3igeo_w2yc_2000_80_06759229_4 absent
21-Dec-2010 13:31:21 [climateprediction.net] Sending scheduler request: To fetch work.
21-Dec-2010 13:31:21 [climateprediction.net] Reporting 3 completed tasks, requesting new tasks for CPU
21-Dec-2010 13:31:29 [climateprediction.net] Scheduler request completed: got 3 new tasks
21-Dec-2010 13:31:31 [climateprediction.net] Started download of hadam3p_pnw_6.08_windows_intelx86.exe
21-Dec-2010 13:31:31 [climateprediction.net] Started download of hadam3p_pnw_um_6.08_windows_intelx86.zip
21-Dec-2010 13:31:37 [climateprediction.net] Finished download of hadam3p_pnw_6.08_windows_intelx86.exe
21-Dec-2010 13:31:37 [climateprediction.net] Started download of hadam3p_pnw_graphics_6.08_windows_intelx86.exe
21-Dec-2010 13:31:41 [climateprediction.net] Finished download of hadam3p_pnw_um_6.08_windows_intelx86.zip
21-Dec-2010 13:31:41 [climateprediction.net] Started download of hadam3p_pnw_se_6.08_windows_intelx86.zip
21-Dec-2010 13:31:47 [climateprediction.net] Finished download of hadam3p_pnw_se_6.08_windows_intelx86.zip
21-Dec-2010 13:31:47 [climateprediction.net] Started download of hadrm3p_pnw_um_6.08_windows_intelx86.zip
21-Dec-2010 13:31:48 [climateprediction.net] [error] File hadam3p_pnw_se_6.08_windows_intelx86.zip has wrong size: expected 987053, got 0
21-Dec-2010 13:31:48 [climateprediction.net] [error] Checksum or signature error for hadam3p_pnw_se_6.08_windows_intelx86.zip
21-Dec-2010 13:31:52 [climateprediction.net] Finished download of hadam3p_pnw_graphics_6.08_windows_intelx86.exe
21-Dec-2010 13:31:52 [climateprediction.net] Started download of hadam3p_pnw_data_6.08_windows_intelx86.zip
21-Dec-2010 13:31:53 [climateprediction.net] [error] File hadam3p_pnw_graphics_6.08_windows_intelx86.exe has wrong size: expected 2098176, got 0
21-Dec-2010 13:31:53 [climateprediction.net] [error] Checksum or signature error for hadam3p_pnw_graphics_6.08_windows_intelx86.exe
21-Dec-2010 13:31:54 [climateprediction.net] Finished download of hadam3p_pnw_data_6.08_windows_intelx86.zip
21-Dec-2010 13:31:54 [climateprediction.net] Started download of hadam3p_eu_xyo1_1960_1_007050137.zip
21-Dec-2010 13:31:55 [climateprediction.net] [error] File hadam3p_pnw_data_6.08_windows_intelx86.zip has wrong size: expected 75116, got 0
21-Dec-2010 13:31:55 [climateprediction.net] [error] Checksum or signature error for hadam3p_pnw_data_6.08_windows_intelx86.zip
21-Dec-2010 13:31:57 [climateprediction.net] Finished download of hadam3p_eu_xyo1_1960_1_007050137.zip
21-Dec-2010 13:31:57 [climateprediction.net] Started download of o3_A2_1959_2010_N96_f.anc.gz
21-Dec-2010 13:31:59 [climateprediction.net] [error] File hadam3p_eu_xyo1_1960_1_007050137.zip has wrong size: expected 12637, got 0
21-Dec-2010 13:31:59 [climateprediction.net] [error] Checksum or signature error for hadam3p_eu_xyo1_1960_1_007050137.zip
21-Dec-2010 13:32:00 [climateprediction.net] Finished download of hadrm3p_pnw_um_6.08_windows_intelx86.zip
21-Dec-2010 13:32:00 [climateprediction.net] Started download of ic19611020_10_N96.gz
21-Dec-2010 13:32:10 [climateprediction.net] Finished download of ic19611020_10_N96.gz
21-Dec-2010 13:32:10 [climateprediction.net] Started download of xaclfa.start.0000.gz
21-Dec-2010 13:32:10 [climateprediction.net] [error] File ic19611020_10_N96.gz has wrong size: expected 1314394, got 0
21-Dec-2010 13:32:10 [climateprediction.net] [error] Checksum or signature error for ic19611020_10_N96.gz
21-Dec-2010 13:32:25 [climateprediction.net] Finished download of o3_A2_1959_2010_N96_f.anc.gz
21-Dec-2010 13:32:25 [climateprediction.net] Started download of oxi.addfa.gz
21-Dec-2010 13:33:04 [climateprediction.net] Sending scheduler request: To report completed tasks.
21-Dec-2010 13:33:04 [climateprediction.net] Reporting 2 completed tasks, not requesting new tasks
21-Dec-2010 13:33:07 [climateprediction.net] Scheduler request completed
21-Dec-2010 13:33:09 [climateprediction.net] [error] File hadam3p_eu_6.08_windows_intelx86.exe has wrong size: expected 780288, got 0
21-Dec-2010 13:33:12 [climateprediction.net] Sending scheduler request: To fetch work.
21-Dec-2010 13:33:12 [climateprediction.net] Not reporting or requesting tasks
21-Dec-2010 13:33:13 [climateprediction.net] Scheduler request completed
21-Dec-2010 13:33:19 [climateprediction.net] Sending scheduler request: To fetch work.
21-Dec-2010 13:33:19 [climateprediction.net] Reporting 1 completed tasks, requesting new tasks for CPU
21-Dec-2010 13:33:21 [climateprediction.net] Scheduler request completed: got 0 new tasks
21-Dec-2010 13:33:21 [climateprediction.net] Message from server: No work sent
21-Dec-2010 13:33:21 [climateprediction.net] Message from server: No work is available for UK Met Office HadSM3 Slab Model
21-Dec-2010 13:33:21 [climateprediction.net] Message from server: No work is available for HadCM3 Coupled Model Experiment Optimised File I/O
21-Dec-2010 13:33:21 [climateprediction.net] Message from server: No work is available for UK Met Office HADAM3P European Region
21-Dec-2010 13:33:21 [climateprediction.net] Message from server: No work is available for UK Met Office HADAM3P Pacific North West
21-Dec-2010 13:33:21 [climateprediction.net] Message from server: (reached daily quota of 3 tasks)
ID: 41424 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2184
Credit: 64,822,615
RAC: 5,275
Message 41425 - Posted: 4 Jan 2011, 0:14:04 UTC

Two of the hadcm3's crashed after the 60th year/trickle and the other after the 57th, all at the same time as you said. Since then, that PC has not been able to correctly download any models.

My guess is a hardware problem. Could be a hard disk issue, a memory issue, or a processor issue. I would test the system with Prime95 for awhile, and any hardware diagnostic software you can (memtest86+, your hard drive manufacturer's diagnostic tests). Until those tests show clean, don't try to download any more boinc tasks.
ID: 41425 · Report as offensive     Reply Quote
NewtonianRefractor

Send message
Joined: 22 May 08
Posts: 49
Credit: 2,335,997
RAC: 0
Message 41426 - Posted: 4 Jan 2011, 1:27:04 UTC
Last modified: 4 Jan 2011, 1:27:16 UTC

That's very interesting because when I accessed the computer it seemed that it was running fine. The up-time was 18 days (I rebooted before I left on winter vacation). Boinc was running and was responsive. It was interesting that it did not contact the server after the 22nd. In the message log it said that it was just running CPU benchmarks every once in a while.
ID: 41426 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2184
Credit: 64,822,615
RAC: 5,275
Message 41427 - Posted: 4 Jan 2011, 3:17:26 UTC

If it's just the software, you could try to do a reset on the climateprediction.net project. That should flush any troublesome files and download a new batch. Of course if it's a hardware error, eventually something similar will happen again.
ID: 41427 · Report as offensive     Reply Quote
NewtonianRefractor

Send message
Joined: 22 May 08
Posts: 49
Credit: 2,335,997
RAC: 0
Message 41428 - Posted: 4 Jan 2011, 6:41:17 UTC - in response to Message 41427.  

I am overclocking the computer, but the problem is that if it is a hardware error related to this it manifests itself very rarely. The models made it to year 60, which is a lot of calculation to go without error. I imagine this is very difficult is not impossible for me to track down.
ID: 41428 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2184
Credit: 64,822,615
RAC: 5,275
Message 41430 - Posted: 4 Jan 2011, 13:58:18 UTC - in response to Message 41428.  

Right now, as it's not downloading any new work, it seems like something has gotten corrupted in your boincdata directory. This is probably because of a hardware error, and quite likely a result of overclocking. If you can't run Prime95 for several hours without an error, cpdn work will crash, it is that simple. Prime95 is more taxing on the system than cpdn and should show an error if there is a cpu or memory problem. If it's hard disk corruption for whatever reason, prime95 likely wouldn't show that, but could also explain the errors listed on your task result webpages.
ID: 41430 · Report as offensive     Reply Quote
NewtonianRefractor

Send message
Joined: 22 May 08
Posts: 49
Credit: 2,335,997
RAC: 0
Message 41431 - Posted: 4 Jan 2011, 16:47:59 UTC - in response to Message 41430.  

I ran checkdisk on the hard-drive with no errors. (it's a 2 TB raid 0 array)

I am running prime95 'blend' mode right now. I will run it for 48 hours.

ID: 41431 · Report as offensive     Reply Quote

Message boards : Number crunching : Computer Erroring Out

©2024 cpdn.org