Message boards :
Number crunching :
Output file absent & Too many errors (may have bug)
Message board moderation
Previous · 1 · 2 · 3
Author | Message |
---|---|
Send message Joined: 15 May 09 Posts: 4345 Credit: 16,523,697 RAC: 5,963 |
My latest one to crash with replanca error was after about 40 hours which on my machine is 4 or 5 zip files worth. This was after a restart but the model had been suspended and file - exit used to shut boinc down before hibernating the computer? Has anyone else had them go this far before crashing? Dave |
Send message Joined: 15 May 09 Posts: 4345 Credit: 16,523,697 RAC: 5,963 |
I see the (presumably offending) tasks have gone from the server. Dave |
Send message Joined: 28 Mar 09 Posts: 126 Credit: 9,825,980 RAC: 0 |
My latest one to crash with replanca error was after about 40 hours which on my machine is 4 or 5 zip files worth. This was after a restart but the model had been suspended and file - exit used to shut boinc down before hibernating the computer? Has anyone else had them go this far before crashing? They usually die straight after the first trickle/zip for me BOINC blog |
Send message Joined: 15 May 09 Posts: 4345 Credit: 16,523,697 RAC: 5,963 |
The rate at which the number of tasks in progress is going down on the server page indicates there are still a lot of units falling over. Dave |
Send message Joined: 4 Sep 04 Posts: 1 Credit: 4,227,572 RAC: 0 |
All the recent ones I have had have failed, for a few days now. Would be nice to have one not fail around the _2.zip point. |
Send message Joined: 28 Mar 09 Posts: 126 Credit: 9,825,980 RAC: 0 |
The rate at which the number of tasks in progress is going down on the server page indicates there are still a lot of units falling over. Once they've been sent out there probably isn't a lot the project can do. While it is possible for the project to abort in-progress tasks, the version of BOINC they are running on CPDN server-side may not support it. GPUgrid used to do it but then people complain about how their task got aborted after many hours crunching. The tasks will fail anyway, so its probably better just to let them die on their own. BOINC blog |
Send message Joined: 14 Apr 05 Posts: 31 Credit: 16,491,691 RAC: 0 |
Every task I have had on my laptop for the last week or so has also failed. The ones I have checked seem to be of the "replanca" variety. However I am unable to obtain any new tasks, so it has been effectively idle for several days now. Is there a problem with the supply of new tasks - possibly as a result of this issue? Brian |
Send message Joined: 16 Jan 10 Posts: 1081 Credit: 7,000,243 RAC: 4,190 |
[nedsram-cdl wrote:]Every task I have had on my laptop for the last week or so has also failed. The ones I have checked seem to be of the "replanca" variety. However I am unable to obtain any new tasks, so it has been effectively idle for several days now. The work units in the queue affected by the REPLANCA problem have been withdrawn and results that are running are failing quickly, so the supply of new units has declined to zero and the total number of running results has reduced somewhat as well. No doubt someone is working on a new set of work units with a correct set of ancillary files and the queue will fill accordingly when that is done. We'll know it's fixed when that happens! |
Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,074,094 RAC: 1,595 |
I just lost a hadam3p_eu WU after the first zip file, probably do to the replanca error. There are 2 hadam3p_eu WU�s (hadam3_eu_ctvq_2005_1_008084837_0 and hadam3p_eu_cum6_2000_1_008085302_1) sitting on my machine, most likely from the same bad batch. Should I abort them before they start or let the run till they crash? Are they from the same bad batch? How do I tell? |
Send message Joined: 7 Aug 04 Posts: 2167 Credit: 64,524,430 RAC: 6,337 |
I just lost a hadam3p_eu WU after the first zip file, probably do to the replanca error. There are 2 hadam3p_eu WU�s (hadam3_eu_ctvq_2005_1_008084837_0 and hadam3p_eu_cum6_2000_1_008085302_1) sitting on my machine, most likely from the same bad batch. It looks like the 2 you mention were downloaded July 24th. Thus, they are likely bad. One of the work units that the tasks belong to have already had a task crash with a REPLANCA error. I'd abort them. |
Send message Joined: 17 Aug 04 Posts: 289 Credit: 44,103,664 RAC: 0 |
hello everyone, sorry but I have not had time to read this whole thread. I'm crunching the following 4 wu and they seem to be returning zip files ok. and I was wondering if it is ok to let them continue to run ? hadam3p_pnw_c6nd_1993_1_008091178 - - Sent - - 26 Jul 2012 14:03:18 UTC hadam3p_pnw_c75k_1968_1_008091170 - - Sent - - 26 Jul 2012 14:03:18 UTC hadcm3n_o44o_2100_40_008085978 - - - - - Sent - - 25 Jul 2012 20:48:43 UTC hadam3p_eu_alis_1998_1_008068421 - - - - Sent - - 19 Jul 2012 18:02:52 UTC my computer id 948812 my account userid=910 thanks , Byron |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
There's 3 separate problems, all from around the time that your models were sent. In order of when they happened to mine: Some will fail at around 9-10 hours, between zips 1 & 2 Some will fail at around 19-20 hours Some will have files that "can't be found", and cause download failures And there were also models that ran OK. The first 2 were due to REPLANCA errors; an auxiliary file not having the correct number of data. The 3rd was an error with the path of a mirror server. All models were deleted from the download pool, but there are still re-sends, caused by people not starting work that they received back then. If you're running any of the failures you'll soon find out. Backups: Here |
Send message Joined: 7 Aug 04 Posts: 2167 Credit: 64,524,430 RAC: 6,337 |
and I was wondering if it is ok to let them continue to run ? Looks like all 4 of them should continue on okay. None look to be in the bad batches. You've already made enough progress on them that they've gotten past the typical failure points for EU and PNS models. |
Send message Joined: 17 Aug 04 Posts: 289 Credit: 44,103,664 RAC: 0 |
Thank you geophi and Les Bayliss for your reply Yes all 4 seem to be continuing ok with no problems. So I will let them continue to run to the end. thanks, Byron |
Send message Joined: 21 Oct 06 Posts: 5 Credit: 2,162,915 RAC: 0 |
I just recently got a result error with the following stdout:
And also the following messages in the client:
Is it another kind of problem with the WUs? |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
The files are missing because the model crashed soon after starting. So none of the output data files got created. It's BOINC complaining about not being able to find then. Only the first couple of lines of the STDERR file are relevant. Backups: Here |
Send message Joined: 28 Mar 09 Posts: 126 Credit: 9,825,980 RAC: 0 |
Another one that crashed...
Wu name: hadam3p_pnw_2yuc_1975_1_008145549_1 Created: 15 Aug 2012 I would link to it but your Akismet anti-spam system thinks your own URL's are spam. The wuid is 8300673 BOINC blog |
Send message Joined: 19 Sep 04 Posts: 92 Credit: 1,936,173 RAC: 351 |
More of these REPLANCA errors in this UK Met Office Coupled Model Full Resolution Ocean WU created Friday http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=8395212 <core_client_version>7.0.28</core_client_version> <![CDATA[ <message> The device does not recognize the command. (0x16) - exit code 22 (0x16) </message> <stderr_txt> Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/pipe_dummy 2048 Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/pipe_dummy 2048 Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/pipe_dummy 2048 Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/pipe_dummy 2048 Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/pipe_dummy 2048 Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/pipe_dummy 2048 Sorry, too many model crashes! :-( Called boinc_finish </stderr_txt> ]]> Professor Desty Nova Researching Karma the Hard Way |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
Thanks, Professor. The REPLANCA errors have been reported to Andy and Jonathan. If one task in a WU crashes with REPLANCA, all the tasks in that WU will, and on all OSs. Cpdn news |
©2024 climateprediction.net