climateprediction.net home page
Posts by old_user691968

Posts by old_user691968

1) Message boards : Number crunching : What went wrong (crashed WU) (Message 45473)
Posted 18 Jan 2013 by old_user691968
Post:
A question about backups: is it enough to copy the \ProgramData\BOINC\projects\climateprediction.net project folder or is it necessary to copy the whole programdata directory as the tutorial says? If I did the latter wouldn't it mean turning back the clock on every task I ran (including other projects) back to the backup time?
2) Message boards : Number crunching : What went wrong (crashed WU) (Message 45471)
Posted 18 Jan 2013 by old_user691968
Post:
As for Joe's problem, there's an awful lot of BOINC suspensions.

I get the feeling that the setting for Suspend work if CPU usage is above
is still at the default of 25%, which means that BOINC, and the science apps, are constantly being stopped and started as Joe uses the computer.

Other project's work may not mind, but the Coupled Ocean models are too touchy for this. Sooner or later they usually fail. Especially if Leave tasks in memory while suspended? isn't set to Yes.




I'd set the suspend work threshold to 0 (ie no threshold) after noticing once that BOINC seesawed between running and not running every ten seconds or so with me doing nothing at the computer. It didn't seem to affect my computer usage. But someone at the BOINC forums claimed that this may be one of my problems.

I've posted a lot about my efforts to eradicate the "Task exited with zero status but no 'finished' file" errors that I got in BOINC's log file corresponding with the time this CPDN model seemed to give up the ghost (11:36:20?), here: boinc.berkeley.edu/dev/forum_thread.php?id=8134&postid=47366

I'd be grateful if an expert from here took a look at that thread to see anything I've missed.

But I'd love to know if these errors are even the cause of the failure--or was it this line?
"Atmos Hold Restart file rename failed on atmos_restart.hold"

And was what I listed the complete stderr log or does it seem to be cut off in the middle?
3) Message boards : Number crunching : What went wrong (crashed WU) (Message 45457)
Posted 15 Jan 2013 by old_user691968
Post:
I've just exclude BOINC and ProgramData from my scanner. Looking at the log it appears the error occurred when BOINC tried to suspend the task while I was away from my computer. Doesn't seem like there's any reason for it to do so except for the scheduled project switching so I've set the project switching interval to 99999 minutes (1666 hours), hopefully long enough for one project to finish running in one go barring any computer downtime. Anything else to look for? Thanks!
4) Message boards : Number crunching : What went wrong (crashed WU) (Message 45455)
Posted 15 Jan 2013 by old_user691968
Post:
stderr of Task 15527415

<core_client_version>7.0.42</core_client_version>
<![CDATA[
<message>
- exit code 193 (0xc1)
</message>
<stderr_txt>
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
00:11:14 (4080): Can't acquire lockfile (32) - waiting 35s
00:11:19 (7496): No heartbeat from core client for 30 sec - exiting
CPDN Monitor - No 'heartbeat' from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
06:38:20 (8224): No heartbeat from core client for 30 sec - exiting
CPDN Monitor - No 'heartbeat' from BOINC...
06:38:21 (8224): No heartbeat from core client for 30 sec - exiting
06:38:22 (8224): No heartbeat from core client for 30 sec - exiting
06:38:23 (8224): No heartbeat from core client for 30 sec - exiting
06:38:24 (8224): No heartbeat from core client for 30 sec - exiting
Suspended CPDN Monitor - Suspend request from BOINC...
09:54:20 (8800): No heartbeat from core client for 30 sec - exiting
CPDN Monitor - No 'heartbeat' from BOINC...
CPDN Monitor - Quit request from BOINC...
16:35:15 (8244): No heartbeat from core client for 30 sec - exiting
CPDN Monitor - No 'heartbeat' from BOINC...
17:43:20 (860): No heartbeat from core client for 30 sec - exiting
CPDN Monitor - No 'heartbeat' from BOINC...
21:39:09 (8268): No heartbeat from core client for 30 sec - exiting
CPDN Monitor - No 'heartbeat' from BOINC...
CPDN Monitor - Quit request from BOINC...
00:28:29 (8804): No heartbeat from core client for 30 sec - exiting
CPDN Monitor - No 'heartbeat' from BOINC...
00:28:39 (8804): No heartbeat from core client for 30 sec - exiting
00:28:40 (8804): No heartbeat from core client for 30 sec - exiting
00:28:42 (8804): No heartbeat from core client for 30 sec - exiting
Suspended CPDN Monitor - Suspend request from BOINC...
03:56:02 (8824): No heartbeat from core client for 30 sec - exiting
Suspended CPDN Monitor - No 'heartbeat' from BOINC...
11:36:16 (4224): No heartbeat from core client for 30 sec - exiting
CPDN Monitor - No 'heartbeat' from BOINC...
11:36:20 (4224): No heartbeat from core client for 30 sec - exiting
Atmos Hold Restart file rename failed on atmos_restart.hold
Suspended CPDN Monitor - Suspend request from BOINC...

</stderr_txt>
]]>

I haven't been able to run a single model to completion, and I've run 4 or 5 WUs by now...

Is a WU totally useless if it isn't completed or can the trickles be used to build a new WU where the old one left off?




©2024 climateprediction.net