Output file absent & Too many errors (may have bug)

Author	Message
Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4345 Credit: 16,523,697 RAC: 5,963	Message 44616 - Posted: 31 Jul 2012, 21:06:55 UTC - in response to Message 44614. My latest one to crash with replanca error was after about 40 hours which on my machine is 4 or 5 zip files worth. This was after a restart but the model had been suspended and file - exit used to shut boinc down before hibernating the computer? Has anyone else had them go this far before crashing? Dave ID: 44616 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4345 Credit: 16,523,697 RAC: 5,963	Message 44617 - Posted: 31 Jul 2012, 21:08:40 UTC - in response to Message 44616. I see the (presumably offending) tasks have gone from the server. Dave ID: 44617 · Reply Quote

MarkJ Send message Joined: 28 Mar 09 Posts: 126 Credit: 9,825,980 RAC: 0	Message 44618 - Posted: 1 Aug 2012, 8:13:17 UTC - in response to Message 44616. Last modified: 1 Aug 2012, 8:15:37 UTC My latest one to crash with replanca error was after about 40 hours which on my machine is 4 or 5 zip files worth. This was after a restart but the model had been suspended and file - exit used to shut boinc down before hibernating the computer? Has anyone else had them go this far before crashing? Dave They usually die straight after the first trickle/zip for me BOINC blog ID: 44618 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4345 Credit: 16,523,697 RAC: 5,963	Message 44619 - Posted: 1 Aug 2012, 16:35:56 UTC The rate at which the number of tasks in progress is going down on the server page indicates there are still a lot of units falling over. Dave ID: 44619 · Reply Quote

Fred Bloggs Send message Joined: 4 Sep 04 Posts: 1 Credit: 4,227,572 RAC: 0	Message 44620 - Posted: 1 Aug 2012, 16:48:12 UTC - in response to Message 44619. All the recent ones I have had have failed, for a few days now. Would be nice to have one not fail around the _2.zip point. ID: 44620 · Reply Quote

MarkJ Send message Joined: 28 Mar 09 Posts: 126 Credit: 9,825,980 RAC: 0	Message 44621 - Posted: 3 Aug 2012, 11:07:07 UTC - in response to Message 44619. The rate at which the number of tasks in progress is going down on the server page indicates there are still a lot of units falling over. Dave Once they've been sent out there probably isn't a lot the project can do. While it is possible for the project to abort in-progress tasks, the version of BOINC they are running on CPDN server-side may not support it. GPUgrid used to do it but then people complain about how their task got aborted after many hours crunching. The tasks will fail anyway, so its probably better just to let them die on their own. BOINC blog ID: 44621 · Reply Quote

nedsram-cdl Send message Joined: 14 Apr 05 Posts: 31 Credit: 16,491,691 RAC: 0	Message 44624 - Posted: 4 Aug 2012, 10:03:06 UTC Every task I have had on my laptop for the last week or so has also failed. The ones I have checked seem to be of the "replanca" variety. However I am unable to obtain any new tasks, so it has been effectively idle for several days now. Is there a problem with the supply of new tasks - possibly as a result of this issue? Brian ID: 44624 · Reply Quote

Iain Inglis Volunteer moderator Send message Joined: 16 Jan 10 Posts: 1081 Credit: 7,000,243 RAC: 4,190	Message 44625 - Posted: 4 Aug 2012, 23:12:10 UTC - in response to Message 44624. [nedsram-cdl wrote:]Every task I have had on my laptop for the last week or so has also failed. The ones I have checked seem to be of the "replanca" variety. However I am unable to obtain any new tasks, so it has been effectively idle for several days now. Is there a problem with the supply of new tasks - possibly as a result of this issue? The work units in the queue affected by the REPLANCA problem have been withdrawn and results that are running are failing quickly, so the supply of new units has declined to zero and the total number of running results has reduced somewhat as well. No doubt someone is working on a new set of work units with a correct set of ancillary files and the queue will fill accordingly when that is done. We'll know it's fixed when that happens! ID: 44625 · Reply Quote

JIM Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,074,094 RAC: 1,595	Message 44626 - Posted: 5 Aug 2012, 15:51:03 UTC Last modified: 5 Aug 2012, 15:51:53 UTC I just lost a hadam3p_eu WU after the first zip file, probably do to the replanca error. There are 2 hadam3p_eu WU�s (hadam3_eu_ctvq_2005_1_008084837_0 and hadam3p_eu_cum6_2000_1_008085302_1) sitting on my machine, most likely from the same bad batch. Should I abort them before they start or let the run till they crash? Are they from the same bad batch? How do I tell? ID: 44626 · Reply Quote

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2167 Credit: 64,524,430 RAC: 6,337	Message 44627 - Posted: 5 Aug 2012, 18:57:22 UTC - in response to Message 44626. I just lost a hadam3p_eu WU after the first zip file, probably do to the replanca error. There are 2 hadam3p_eu WU�s (hadam3_eu_ctvq_2005_1_008084837_0 and hadam3p_eu_cum6_2000_1_008085302_1) sitting on my machine, most likely from the same bad batch. Should I abort them before they start or let the run till they crash? Are they from the same bad batch? How do I tell? It looks like the 2 you mention were downloaded July 24th. Thus, they are likely bad. One of the work units that the tasks belong to have already had a task crash with a REPLANCA error. I'd abort them. ID: 44627 · Reply Quote

Byron Leigh Hatch @ team Carl ... Send message Joined: 17 Aug 04 Posts: 289 Credit: 44,103,664 RAC: 0	Message 44628 - Posted: 6 Aug 2012, 3:18:02 UTC hello everyone, sorry but I have not had time to read this whole thread. I'm crunching the following 4 wu and they seem to be returning zip files ok. and I was wondering if it is ok to let them continue to run ? hadam3p_pnw_c6nd_1993_1_008091178 - - Sent - - 26 Jul 2012 14:03:18 UTC hadam3p_pnw_c75k_1968_1_008091170 - - Sent - - 26 Jul 2012 14:03:18 UTC hadcm3n_o44o_2100_40_008085978 - - - - - Sent - - 25 Jul 2012 20:48:43 UTC hadam3p_eu_alis_1998_1_008068421 - - - - Sent - - 19 Jul 2012 18:02:52 UTC my computer id 948812 my account userid=910 thanks , Byron ID: 44628 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 44629 - Posted: 6 Aug 2012, 5:43:33 UTC - in response to Message 44628. There's 3 separate problems, all from around the time that your models were sent. In order of when they happened to mine: Some will fail at around 9-10 hours, between zips 1 & 2 Some will fail at around 19-20 hours Some will have files that "can't be found", and cause download failures And there were also models that ran OK. The first 2 were due to REPLANCA errors; an auxiliary file not having the correct number of data. The 3rd was an error with the path of a mirror server. All models were deleted from the download pool, but there are still re-sends, caused by people not starting work that they received back then. If you're running any of the failures you'll soon find out. Backups: Here ID: 44629 · Reply Quote

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2167 Credit: 64,524,430 RAC: 6,337	Message 44630 - Posted: 6 Aug 2012, 14:33:19 UTC - in response to Message 44628. and I was wondering if it is ok to let them continue to run ? hadam3p_pnw_c6nd_1993_1_008091178 - - Sent - - 26 Jul 2012 14:03:18 UTC hadam3p_pnw_c75k_1968_1_008091170 - - Sent - - 26 Jul 2012 14:03:18 UTC hadcm3n_o44o_2100_40_008085978 - - - - - Sent - - 25 Jul 2012 20:48:43 UTC hadam3p_eu_alis_1998_1_008068421 - - - - Sent - - 19 Jul 2012 18:02:52 UTC my computer id 948812 my account userid=910 Looks like all 4 of them should continue on okay. None look to be in the bad batches. You've already made enough progress on them that they've gotten past the typical failure points for EU and PNS models. ID: 44630 · Reply Quote

Byron Leigh Hatch @ team Carl ... Send message Joined: 17 Aug 04 Posts: 289 Credit: 44,103,664 RAC: 0	Message 44633 - Posted: 7 Aug 2012, 11:49:53 UTC - in response to Message 44630. Thank you geophi and Les Bayliss for your reply Yes all 4 seem to be continuing ok with no problems. So I will let them continue to run to the end. thanks, Byron ID: 44633 · Reply Quote

AlphaLaser Send message Joined: 21 Oct 06 Posts: 5 Credit: 2,162,915 RAC: 0	Message 44690 - Posted: 13 Aug 2012, 3:38:01 UTC I just recently got a result error with the following stdout: <core_client_version>6.10.58</core_client_version> <![CDATA[ <stderr_txt> Model crashed: INITTIME: Atmosphere basis time mismatch tmp/xaakm.pipe_dummy 2048 Leaving CPDN_Main::Monitor... Called boinc_finish </stderr_txt> <message> <file_xfer_error> <file_name>hadam3p_eu_69wa_2000_1_008138105_0_1.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadam3p_eu_69wa_2000_1_008138105_0_2.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadam3p_eu_69wa_2000_1_008138105_0_3.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadam3p_eu_69wa_2000_1_008138105_0_4.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadam3p_eu_69wa_2000_1_008138105_0_5.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadam3p_eu_69wa_2000_1_008138105_0_6.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadam3p_eu_69wa_2000_1_008138105_0_7.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadam3p_eu_69wa_2000_1_008138105_0_8.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadam3p_eu_69wa_2000_1_008138105_0_9.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadam3p_eu_69wa_2000_1_008138105_0_10.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadam3p_eu_69wa_2000_1_008138105_0_11.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadam3p_eu_69wa_2000_1_008138105_0_12.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadam3p_eu_69wa_2000_1_008138105_0_13.zip</file_name> <error_code>-161</error_code> </file_xfer_error> </message> ]]> And also the following messages in the client: 8/12/2012 9:26:31 PM climateprediction.net Computation for task hadam3p_eu_69wa_2000_1_008138105_0 finished 8/12/2012 9:26:31 PM climateprediction.net Output file hadam3p_eu_69wa_2000_1_008138105_0_1.zip for task hadam3p_eu_69wa_2000_1_008138105_0 absent 8/12/2012 9:26:31 PM climateprediction.net Output file hadam3p_eu_69wa_2000_1_008138105_0_2.zip for task hadam3p_eu_69wa_2000_1_008138105_0 absent 8/12/2012 9:26:31 PM climateprediction.net Output file hadam3p_eu_69wa_2000_1_008138105_0_3.zip for task hadam3p_eu_69wa_2000_1_008138105_0 absent 8/12/2012 9:26:31 PM climateprediction.net Output file hadam3p_eu_69wa_2000_1_008138105_0_4.zip for task hadam3p_eu_69wa_2000_1_008138105_0 absent 8/12/2012 9:26:31 PM climateprediction.net Output file hadam3p_eu_69wa_2000_1_008138105_0_5.zip for task hadam3p_eu_69wa_2000_1_008138105_0 absent 8/12/2012 9:26:31 PM climateprediction.net Output file hadam3p_eu_69wa_2000_1_008138105_0_6.zip for task hadam3p_eu_69wa_2000_1_008138105_0 absent 8/12/2012 9:26:31 PM climateprediction.net Output file hadam3p_eu_69wa_2000_1_008138105_0_7.zip for task hadam3p_eu_69wa_2000_1_008138105_0 absent 8/12/2012 9:26:31 PM climateprediction.net Output file hadam3p_eu_69wa_2000_1_008138105_0_8.zip for task hadam3p_eu_69wa_2000_1_008138105_0 absent 8/12/2012 9:26:31 PM climateprediction.net Output file hadam3p_eu_69wa_2000_1_008138105_0_9.zip for task hadam3p_eu_69wa_2000_1_008138105_0 absent 8/12/2012 9:26:31 PM climateprediction.net Output file hadam3p_eu_69wa_2000_1_008138105_0_10.zip for task hadam3p_eu_69wa_2000_1_008138105_0 absent 8/12/2012 9:26:31 PM climateprediction.net Output file hadam3p_eu_69wa_2000_1_008138105_0_11.zip for task hadam3p_eu_69wa_2000_1_008138105_0 absent 8/12/2012 9:26:31 PM climateprediction.net Output file hadam3p_eu_69wa_2000_1_008138105_0_12.zip for task hadam3p_eu_69wa_2000_1_008138105_0 absent 8/12/2012 9:26:31 PM climateprediction.net Output file hadam3p_eu_69wa_2000_1_008138105_0_13.zip for task hadam3p_eu_69wa_2000_1_008138105_0 absent Is it another kind of problem with the WUs? ID: 44690 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 44691 - Posted: 13 Aug 2012, 5:04:27 UTC - in response to Message 44690. The files are missing because the model crashed soon after starting. So none of the output data files got created. It's BOINC complaining about not being able to find then. Only the first couple of lines of the STDERR file are relevant. Backups: Here ID: 44691 · Reply Quote

MarkJ Send message Joined: 28 Mar 09 Posts: 126 Credit: 9,825,980 RAC: 0	Message 44747 - Posted: 20 Aug 2012, 11:44:54 UTC Last modified: 20 Aug 2012, 11:45:58 UTC Another one that crashed... Model crashed: INITTIME: Atmosphere basis time mismatch tmp/xaakm.pipe_dummy 2048 Leaving CPDN_Main::Monitor... Regional yearly means requires 12 input files got 0 Wu name: hadam3p_pnw_2yuc_1975_1_008145549_1 Created: 15 Aug 2012 I would link to it but your Akismet anti-spam system thinks your own URL's are spam. The wuid is 8300673 BOINC blog ID: 44747 · Reply Quote

Professor Desty Nova Send message Joined: 19 Sep 04 Posts: 92 Credit: 1,936,173 RAC: 351	Message 45193 - Posted: 28 Oct 2012, 23:03:51 UTC More of these REPLANCA errors in this UK Met Office Coupled Model Full Resolution Ocean WU created Friday http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=8395212 <core_client_version>7.0.28</core_client_version> <![CDATA[ <message> The device does not recognize the command. (0x16) - exit code 22 (0x16) </message> <stderr_txt> Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/pipe_dummy 2048 Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/pipe_dummy 2048 Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/pipe_dummy 2048 Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/pipe_dummy 2048 Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/pipe_dummy 2048 Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/pipe_dummy 2048 Sorry, too many model crashes! :-( Called boinc_finish </stderr_txt> ]]> Professor Desty Nova Researching Karma the Hard Way ID: 45193 · Reply Quote

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 45201 - Posted: 31 Oct 2012, 2:52:27 UTC Thanks, Professor. The REPLANCA errors have been reported to Andy and Jonathan. If one task in a WU crashes with REPLANCA, all the tasks in that WU will, and on all OSs. Cpdn news ID: 45201 · Reply Quote