climateprediction.net home page
Bug narrowed down : [climateprediction.net] Computation for result xxxx finished

Bug narrowed down : [climateprediction.net] Computation for result xxxx finished

Questions and Answers : Windows : Bug narrowed down : [climateprediction.net] Computation for result xxxx finished
Message board moderation

To post messages, you must log in.

AuthorMessage
old_user169

Send message
Joined: 5 Aug 04
Posts: 39
Credit: 87,633
RAC: 0
Message 5894 - Posted: 3 Nov 2004, 19:26:23 UTC
Last modified: 14 Nov 2004, 16:13:13 UTC

After stopping BOINC for a moment and then restarting it, it destroyed 2 results :

stdout.txt =

2004-11-03 20:40:28 [---] Starting BOINC client version 4.13 for windows_intelx86
2004-11-03 20:40:28 [climateprediction.net] Project prefs: using separate prefs for school
2004-11-03 20:40:28 [LHC@home] Project prefs: using separate prefs for school
2004-11-03 20:40:28 [climateprediction.net] Host ID is 46600
2004-11-03 20:40:28 [LHC@home] Host ID is 17130
2004-11-03 20:40:28 [---] General prefs: from LHC@home (last modified 2004-11-02 22:50:22)
2004-11-03 20:40:28 [---] General prefs: using separate prefs for school
2004-11-03 20:40:28 [climateprediction.net] Resuming computation for result 2z86_000160367_1 using hadsm3 version 4.03
2004-11-03 20:40:28 [climateprediction.net] Resuming computation for result 30jy_000162104_1 using hadsm3 version 4.03
2004-11-03 20:40:29 [LHC@home] Started upload of v64lhc1000protwelve-58s8_1053.42_1_sixvf_39751_0_0
2004-11-03 20:40:29 [LHC@home] Started upload of v64lhc1000protwelve-59s10_12553.46_1_sixvf_42652_1_0
2004-11-03 20:40:29 [climateprediction.net] Computation for result 2z86_000160367 finished
2004-11-03 20:40:29 [climateprediction.net] Starting result 3siy_000198715_0 using hadsm3 version 4.04
2004-11-03 20:40:29 [climateprediction.net] Computation for result 30jy_000162104 finished

stderr.txt =

2004-11-03 20:40:29 [climateprediction.net] Unrecoverable error for result 2z86_000160367_1 ( - exit code -1 (0xffffffff))
2004-11-03 20:40:29 [climateprediction.net] Unrecoverable error for result 30jy_000162104_1 ( - exit code -1 (0xffffffff))
2004-11-03 20:40:29 [climateprediction.net] Deferring communication with project for 1 minutes and 0 seconds

It happened while LHC has been completely unreachable, not sure if the problem is related to that but the chance is quite high as it immediately tried to upload a bunch of LHC results which of course failed.

After stopping BOINC again, I saw that hadsm3um_4.03_windows_intelx86.exe was still running although all other BOINC processes have been gone. This might be a reason for this failure too of course.

BOINC 4.13 / Win2k SP4 / Dual Athlon MP 2600+


I have saved all XML and project files and will report the two damaged WUs now :

<a>resultid=268718</a>
<a>resultid=261770</a>


If you need any of the files to help figure out the problem, I can upload them to some web space.

90 trickles lost in BOINC space :´(
ID: 5894 · Report as offensive     Reply Quote
old_user169

Send message
Joined: 5 Aug 04
Posts: 39
Credit: 87,633
RAC: 0
Message 5895 - Posted: 3 Nov 2004, 19:44:50 UTC

I found something more, that looks suspicious in stderr_um.txt :

...
OPEN: File dataout/2z86ca.dap3bj0 Created on Unit 22
OPEN: File dataout/2z86ca.dap3bm0 Created on Unit 22
OPEN: File dataout/2z86ca.dap3bp0 Created on Unit 22
OPEN: File dataout/2z86ca.dap3bs0 Created on Unit 22
OPEN: File dataout/2z86ca.dap3c10 Created on Unit 22
CLOSE: WARNING: Unit 60 Not Opened
OPEN: File dataout/2z86ca.pap4c10 Created on Unit 60
CLOSE: WARNING: Unit 63 Not Opened
OPEN: File dataout/2z86ca.pdp4c10 Created on Unit 63
CLOSE: WARNING: Unit 64 Not Opened
OPEN: File dataout/2z86ca.pep4c10 Created on Unit 64
CLOSE: WARNING: Unit 65 Not Opened
OPEN: File dataout/2z86ca.pfp4c10 Created on Unit 65
CLOSE: WARNING: Unit 66 Not Opened
OPEN: File dataout/2z86ca.pgp4c10 Created on Unit 66
CLOSE: WARNING: Unit 67 Not Opened
OPEN: File dataout/2z86ca.php4c10 Created on Unit 67
OPEN: File dataout/2z86ca.dap3c40 Created on Unit 22
OPEN: File dataout/2z86ca.dap3c70 Created on Unit 22
OPEN: File dataout/2z86ca.dap3ca0 Created on Unit 22
...

The other model that has been destroyed did not have any error in this file.
ID: 5895 · Report as offensive     Reply Quote
old_user169

Send message
Joined: 5 Aug 04
Posts: 39
Credit: 87,633
RAC: 0
Message 6100 - Posted: 14 Nov 2004, 10:27:12 UTC
Last modified: 14 Nov 2004, 10:39:43 UTC

It (nearly) happened again - this time with the CLI :

I shut down the BOINC CLI but the CPDN client kept running.

But this time I saw that it was still there so I killed CPDN from the task manager - it did not destroy the model this time, it restarted properly.

So now I'm quite sure, the problem is that under certain circumstances the project client doesn't end but the BOINC client doesn't retry to kill it.


The machine was under heavy load when it happened (one CPU was doing Seti Classic, the other was supposed to do some 3D rendering stuff). I reproduced it 3 times now under the same load - it is definitely a bug.


(I posted a link to this thread to the BOINC forum.)
ID: 6100 · Report as offensive     Reply Quote
old_user169

Send message
Joined: 5 Aug 04
Posts: 39
Credit: 87,633
RAC: 0
Message 6116 - Posted: 15 Nov 2004, 6:20:10 UTC

One more thought about the problem :

If hadsm3<b>se</b>...exe calls <i>boinc_init_options()</i> with <i>opt.main_program=true</i> instead of hadsm3<b>um</b>...exe, this would explain why slots/?/boinc_lockfile didn't do the job to avoid a second client working on the same model.
ID: 6116 · Report as offensive     Reply Quote

Questions and Answers : Windows : Bug narrowed down : [climateprediction.net] Computation for result xxxx finished

©2024 climateprediction.net