What went wrong (crashed WU)

Author	Message
old_user691968 Send message Joined: 30 Dec 12 Posts: 4 Credit: 7,776 RAC: 0	Message 45455 - Posted: 15 Jan 2013, 11:30:46 UTC stderr of Task 15527415 <core_client_version>7.0.42</core_client_version> <![CDATA[ <message> - exit code 193 (0xc1) </message> <stderr_txt> Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... 00:11:14 (4080): Can't acquire lockfile (32) - waiting 35s 00:11:19 (7496): No heartbeat from core client for 30 sec - exiting CPDN Monitor - No 'heartbeat' from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... 06:38:20 (8224): No heartbeat from core client for 30 sec - exiting CPDN Monitor - No 'heartbeat' from BOINC... 06:38:21 (8224): No heartbeat from core client for 30 sec - exiting 06:38:22 (8224): No heartbeat from core client for 30 sec - exiting 06:38:23 (8224): No heartbeat from core client for 30 sec - exiting 06:38:24 (8224): No heartbeat from core client for 30 sec - exiting Suspended CPDN Monitor - Suspend request from BOINC... 09:54:20 (8800): No heartbeat from core client for 30 sec - exiting CPDN Monitor - No 'heartbeat' from BOINC... CPDN Monitor - Quit request from BOINC... 16:35:15 (8244): No heartbeat from core client for 30 sec - exiting CPDN Monitor - No 'heartbeat' from BOINC... 17:43:20 (860): No heartbeat from core client for 30 sec - exiting CPDN Monitor - No 'heartbeat' from BOINC... 21:39:09 (8268): No heartbeat from core client for 30 sec - exiting CPDN Monitor - No 'heartbeat' from BOINC... CPDN Monitor - Quit request from BOINC... 00:28:29 (8804): No heartbeat from core client for 30 sec - exiting CPDN Monitor - No 'heartbeat' from BOINC... 00:28:39 (8804): No heartbeat from core client for 30 sec - exiting 00:28:40 (8804): No heartbeat from core client for 30 sec - exiting 00:28:42 (8804): No heartbeat from core client for 30 sec - exiting Suspended CPDN Monitor - Suspend request from BOINC... 03:56:02 (8824): No heartbeat from core client for 30 sec - exiting Suspended CPDN Monitor - No 'heartbeat' from BOINC... 11:36:16 (4224): No heartbeat from core client for 30 sec - exiting CPDN Monitor - No 'heartbeat' from BOINC... 11:36:20 (4224): No heartbeat from core client for 30 sec - exiting Atmos Hold Restart file rename failed on atmos_restart.hold Suspended CPDN Monitor - Suspend request from BOINC... </stderr_txt> ]]> I haven't been able to run a single model to completion, and I've run 4 or 5 WUs by now... Is a WU totally useless if it isn't completed or can the trickles be used to build a new WU where the old one left off? ID: 45455 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4349 Credit: 16,546,891 RAC: 4,053	Message 45456 - Posted: 15 Jan 2013, 12:31:54 UTC - in response to Message 45455. Some models crash because an impossible climate is generated e.g. -ve pressure. However the fact that all your models are crashing points to something else. It is worth making sure that your BOINC folder is excluded from any antivirus program as if BOINC tries to write to a file while the antivirus has an exclusive lock on it the task will crash. The current models available have a habit of crashing @ the 25,50,75 and 100% points, particularly if the computer is shut down and restarted around these points. When available the regional models are much less prone to this. Before long one of the moderators will be along to fill in the bits I have missed out of which there are quite a few. ID: 45456 · Reply Quote

old_user691968 Send message Joined: 30 Dec 12 Posts: 4 Credit: 7,776 RAC: 0	Message 45457 - Posted: 15 Jan 2013, 13:02:11 UTC I've just exclude BOINC and ProgramData from my scanner. Looking at the log it appears the error occurred when BOINC tried to suspend the task while I was away from my computer. Doesn't seem like there's any reason for it to do so except for the scheduled project switching so I've set the project switching interval to 99999 minutes (1666 hours), hopefully long enough for one project to finish running in one go barring any computer downtime. Anything else to look for? Thanks! ID: 45457 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4349 Credit: 16,546,891 RAC: 4,053	Message 45458 - Posted: 15 Jan 2013, 13:37:20 UTC - in response to Message 45457. Don't know about project switching as when work is available I only run CPDN. The other thing I just remembered is that running another application that is very heavy on cpu time e.g. video rendering can cause it to crash and it is worth suspending tasks before doing that. Worth checking the times the other models crashed to see if they all do it while you are away. The last one to crash looks as if it may be at the 25% point when they are more vulnerable to crashing. ID: 45458 · Reply Quote

Byron Leigh Hatch @ team Carl ... Send message Joined: 17 Aug 04 Posts: 289 Credit: 44,103,664 RAC: 0	Message 45459 - Posted: 15 Jan 2013, 20:56:33 UTC - in response to Message 45456. Last modified: 15 Jan 2013, 21:16:22 UTC - Dave Jackson wrote: <quote> It is worth making sure that your BOINC folder is excluded from any antivirus program as if BOINC tries to write to a file while the antivirus has an exclusive lock on it the task will crash. </quote> HI Dave, I don't know how to do this. Could you - or anyone - kindly give me some instructions on how I would do this ? I'm running Windows 7 Ultimate x86 Edition, Service Pack 1, (06.01.7601.00) BOINC 7.0.28 (x86) - running as a single instillation - (not as a service) I'm using McAfee Anti Virus on this Computer - My Computer # 1167855 - my fastest Computer - I only run CPDN thanks in advance Byron ID: 45459 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 45460 - Posted: 16 Jan 2013, 0:10:11 UTC - in response to Message 45459. Somewhere in the menu of your AV, there should be a place where you can specify exceptions. In may be in an Options section, and the words used will most likely vary between AVs. I've got separate logical drives for both parts of BOINC, so I just need to specify a drive letter, but others will need to have a longer string to define the locations. There may be a Browse option, which will allow you to hunt for the locations, and then click to specify them. There may be 2 parts to this: 1) A regular, automated scan. 2) A manual scan. Both need to be set if they're separate. Backups: Here ID: 45460 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4349 Credit: 16,546,891 RAC: 4,053	Message 45461 - Posted: 16 Jan 2013, 9:07:04 UTC All I know about sorting out windows problems with BOINC and CPDN is from reading here. Last time my own machines had window$ on them was 13 years ago. I have been all Linux since then so can't tell you any more than Les has. ID: 45461 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 45462 - Posted: 16 Jan 2013, 21:39:38 UTC As for Joe's problem, there's an awful lot of BOINC suspensions. I get the feeling that the setting for Suspend work if CPU usage is above is still at the default of 25%, which means that BOINC, and the science apps, are constantly being stopped and started as Joe uses the computer. Other project's work may not mind, but the Coupled Ocean models are too touchy for this. Sooner or later they usually fail. Especially if Leave tasks in memory while suspended? isn't set to Yes. Backups: Here ID: 45462 · Reply Quote

Chris Send message Joined: 9 Apr 12 Posts: 10 Credit: 2,700,404 RAC: 0	Message 45463 - Posted: 16 Jan 2013, 23:00:26 UTC Any idea why these are so much worse than the regional models? They take 3 weeks to run, so there is a lot of time for things to happen, but is there no way to make sure they suspend uneventfully? Even the shorter running models take a lot of faith to run, and the big ones, where its possible, and even likely, that I'll loose a model after 20 days isn't so good. ID: 45463 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 45464 - Posted: 17 Jan 2013, 1:19:13 UTC - in response to Message 45463. To start with, the models are from among those created by and for the UK Met Office, where they run on supercomputers. i.e. No interruptions. The long models are Coupled Ocean models, whereas the regional are 'slab ocean'. i.e. a fixed set of values for the ocean. Some years ago, an attempt was made to make a version of 'FAMOUS' models that would run multi-core. This didn't even get out of the alpha (in house) testing, apparently because of the coupling between the ocean and the atmosphere components. They quickly became unstable. So it's possible that the hadcm3 models are more finicky because of this ocean-atmosphere bit. If treated carefully, they (mostly) run OK. If treated as 'just another Windows program which can be interrupted whenever', then they have problems. Perhaps because of having a lot of files open, and being interrupted just as one has been updated, and it's matching partner(s) haven't been yet? All just guess work, so your ideas are as good as mine. :) Backups: Here ID: 45464 · Reply Quote

Chris Send message Joined: 9 Apr 12 Posts: 10 Credit: 2,700,404 RAC: 0	Message 45469 - Posted: 18 Jan 2013, 3:29:24 UTC Do you know which computers they run/ran on? With so many being multi-core it seems odd there is this much trouble trying to run them in parallel, either on cpu or gpu. Wikipedia says the model at least a decade old, but supercomputers had many cores by then. ID: 45469 · Reply Quote

old_user691968 Send message Joined: 30 Dec 12 Posts: 4 Credit: 7,776 RAC: 0	Message 45471 - Posted: 18 Jan 2013, 5:05:16 UTC - in response to Message 45462. As for Joe's problem, there's an awful lot of BOINC suspensions. I get the feeling that the setting for Suspend work if CPU usage is above is still at the default of 25%, which means that BOINC, and the science apps, are constantly being stopped and started as Joe uses the computer. Other project's work may not mind, but the Coupled Ocean models are too touchy for this. Sooner or later they usually fail. Especially if Leave tasks in memory while suspended? isn't set to Yes. I'd set the suspend work threshold to 0 (ie no threshold) after noticing once that BOINC seesawed between running and not running every ten seconds or so with me doing nothing at the computer. It didn't seem to affect my computer usage. But someone at the BOINC forums claimed that this may be one of my problems. I've posted a lot about my efforts to eradicate the "Task exited with zero status but no 'finished' file" errors that I got in BOINC's log file corresponding with the time this CPDN model seemed to give up the ghost (11:36:20?), here: boinc.berkeley.edu/dev/forum_thread.php?id=8134&postid=47366 I'd be grateful if an expert from here took a look at that thread to see anything I've missed. But I'd love to know if these errors are even the cause of the failure--or was it this line? "Atmos Hold Restart file rename failed on atmos_restart.hold" And was what I listed the complete stderr log or does it seem to be cut off in the middle? ID: 45471 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 45472 - Posted: 18 Jan 2013, 5:07:20 UTC - in response to Message 45469. Supercomputers Backups: Here ID: 45472 · Reply Quote

old_user691968 Send message Joined: 30 Dec 12 Posts: 4 Credit: 7,776 RAC: 0	Message 45473 - Posted: 18 Jan 2013, 5:09:45 UTC A question about backups: is it enough to copy the \ProgramData\BOINC\projects\climateprediction.net project folder or is it necessary to copy the whole programdata directory as the tutorial says? If I did the latter wouldn't it mean turning back the clock on every task I ran (including other projects) back to the backup time? ID: 45473 · Reply Quote

JIM Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,105,570 RAC: 2,789	Message 45474 - Posted: 18 Jan 2013, 6:05:31 UTC - in response to Message 45473. No. In order to make a usable backup you need to copy everything in the ProgramData/BOINC folder. Yes the clock will be reset on all projects so it is good to make backups every few days. ID: 45474 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 45475 - Posted: 18 Jan 2013, 7:47:21 UTC As Jim says - EVERYTHING. This is because there are things in the BOINC part, such as client_state.xml, which could also be called: BOINC's To Do list. And it MUST be done after both the manager and the client has been shut down. These days, backups are mostly a protection against power failures in their many forms, such as "the dog tripped over the power cable and pulled it out". As for clock problems, if you do a search of these boards you'll find very little mention of it. Except for this thread. :) As well as 'clock going backwards for all projects', there'll also be the problem of the computers getting new IDs for all projects. Unless you know the secret of altering the afore mentioned client_state.xml, 'which ain't easy'. And speaking of 'clock going backwards for all projects', this will, depending on how bad it is, and the server settings for various projects, result in other projects aborting their WUs due to time problems. Backups: Here ID: 45475 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 45477 - Posted: 18 Jan 2013, 19:47:33 UTC Last modified: 18 Jan 2013, 19:48:14 UTC Stepping back a bit I caught sight of the forest. So, some other thoughts. "Task exited with zero status but no finished file" messages are "mostly harmless". People have been getting this on and off for years, and most of the time the model can restart OK. It's not something that's considered a worry. But there are many other things that can cause a model to fail, including the very nature of the models. (Mentioned elsewhere.) "No heartbeat ..." is another message that often shows up, but it just means that BOINC got too busy to detect something it needed, and then complained. Probably to itself. Although, as a result, it's possible that BOINC can get itself so muddled that it deletes WUs because it thinks it's "their fault" rather than it's own. The computer clock being "out" (fast?), may be a worry when running other projects, but it would only be a problem here if it became "out" by several years. And even then I'm not sure what would happen, because most people's computers successfully resync regularly. I suspect that a lot of cpdn failures in recent times may be due to people cramming as much as the can into a computer to run at the same time, and just plain overloading/overwhelming the hardware. Backups: Here ID: 45477 · Reply Quote

tullio Send message Joined: 6 Aug 04 Posts: 264 Credit: 965,476 RAC: 0	Message 45528 - Posted: 1 Feb 2013, 15:28:49 UTC This model (hadam3p_eu) failed after 34.8 s. Here is the reason: Model crashed: INITTIME: Atmosphere basis time mismatch tmp/xaakm.pipe_dummy 2048 Leaving CPDN_Main::Monitor... Called boinc_finish Another one is however running. Tullio ID: 45528 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 45529 - Posted: 1 Feb 2013, 19:42:39 UTC - in response to Message 45528. I've had 2, and there's a few others as well. I reported it earlier, and the project people are looking into it. ID: 45529 · Reply Quote