climateprediction.net home page
Output file absent & Too many errors (may have bug)

Output file absent & Too many errors (may have bug)

Message boards : Number crunching : Output file absent & Too many errors (may have bug)
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3

AuthorMessage
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4345
Credit: 16,523,697
RAC: 5,963
Message 44616 - Posted: 31 Jul 2012, 21:06:55 UTC - in response to Message 44614.  

My latest one to crash with replanca error was after about 40 hours which on my machine is 4 or 5 zip files worth. This was after a restart but the model had been suspended and file - exit used to shut boinc down before hibernating the computer? Has anyone else had them go this far before crashing?

Dave
ID: 44616 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4345
Credit: 16,523,697
RAC: 5,963
Message 44617 - Posted: 31 Jul 2012, 21:08:40 UTC - in response to Message 44616.  

I see the (presumably offending) tasks have gone from the server.

Dave
ID: 44617 · Report as offensive     Reply Quote
MarkJ
Avatar

Send message
Joined: 28 Mar 09
Posts: 126
Credit: 9,825,980
RAC: 0
Message 44618 - Posted: 1 Aug 2012, 8:13:17 UTC - in response to Message 44616.  
Last modified: 1 Aug 2012, 8:15:37 UTC

My latest one to crash with replanca error was after about 40 hours which on my machine is 4 or 5 zip files worth. This was after a restart but the model had been suspended and file - exit used to shut boinc down before hibernating the computer? Has anyone else had them go this far before crashing?

Dave


They usually die straight after the first trickle/zip for me
BOINC blog
ID: 44618 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4345
Credit: 16,523,697
RAC: 5,963
Message 44619 - Posted: 1 Aug 2012, 16:35:56 UTC

The rate at which the number of tasks in progress is going down on the server page indicates there are still a lot of units falling over.

Dave
ID: 44619 · Report as offensive     Reply Quote
Fred Bloggs

Send message
Joined: 4 Sep 04
Posts: 1
Credit: 4,227,572
RAC: 0
Message 44620 - Posted: 1 Aug 2012, 16:48:12 UTC - in response to Message 44619.  

All the recent ones I have had have failed, for a few days now.

Would be nice to have one not fail around the _2.zip point.
ID: 44620 · Report as offensive     Reply Quote
MarkJ
Avatar

Send message
Joined: 28 Mar 09
Posts: 126
Credit: 9,825,980
RAC: 0
Message 44621 - Posted: 3 Aug 2012, 11:07:07 UTC - in response to Message 44619.  

The rate at which the number of tasks in progress is going down on the server page indicates there are still a lot of units falling over.

Dave


Once they've been sent out there probably isn't a lot the project can do. While it is possible for the project to abort in-progress tasks, the version of BOINC they are running on CPDN server-side may not support it. GPUgrid used to do it but then people complain about how their task got aborted after many hours crunching. The tasks will fail anyway, so its probably better just to let them die on their own.
BOINC blog
ID: 44621 · Report as offensive     Reply Quote
nedsram-cdl

Send message
Joined: 14 Apr 05
Posts: 31
Credit: 16,491,691
RAC: 0
Message 44624 - Posted: 4 Aug 2012, 10:03:06 UTC

Every task I have had on my laptop for the last week or so has also failed. The ones I have checked seem to be of the "replanca" variety. However I am unable to obtain any new tasks, so it has been effectively idle for several days now.

Is there a problem with the supply of new tasks - possibly as a result of this issue?
Brian
ID: 44624 · Report as offensive     Reply Quote
Profile Iain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1081
Credit: 7,000,243
RAC: 4,190
Message 44625 - Posted: 4 Aug 2012, 23:12:10 UTC - in response to Message 44624.  

[nedsram-cdl wrote:]Every task I have had on my laptop for the last week or so has also failed. The ones I have checked seem to be of the "replanca" variety. However I am unable to obtain any new tasks, so it has been effectively idle for several days now.

Is there a problem with the supply of new tasks - possibly as a result of this issue?

The work units in the queue affected by the REPLANCA problem have been withdrawn and results that are running are failing quickly, so the supply of new units has declined to zero and the total number of running results has reduced somewhat as well. No doubt someone is working on a new set of work units with a correct set of ancillary files and the queue will fill accordingly when that is done. We'll know it's fixed when that happens!
ID: 44625 · Report as offensive     Reply Quote
Profile JIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,074,094
RAC: 1,595
Message 44626 - Posted: 5 Aug 2012, 15:51:03 UTC
Last modified: 5 Aug 2012, 15:51:53 UTC

I just lost a hadam3p_eu WU after the first zip file, probably do to the replanca error. There are 2 hadam3p_eu WU�s (hadam3_eu_ctvq_2005_1_008084837_0 and hadam3p_eu_cum6_2000_1_008085302_1) sitting on my machine, most likely from the same bad batch.

Should I abort them before they start or let the run till they crash? Are they from the same bad batch? How do I tell?
ID: 44626 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2167
Credit: 64,524,430
RAC: 6,337
Message 44627 - Posted: 5 Aug 2012, 18:57:22 UTC - in response to Message 44626.  

I just lost a hadam3p_eu WU after the first zip file, probably do to the replanca error. There are 2 hadam3p_eu WU�s (hadam3_eu_ctvq_2005_1_008084837_0 and hadam3p_eu_cum6_2000_1_008085302_1) sitting on my machine, most likely from the same bad batch.

Should I abort them before they start or let the run till they crash? Are they from the same bad batch? How do I tell?


It looks like the 2 you mention were downloaded July 24th. Thus, they are likely bad. One of the work units that the tasks belong to have already had a task crash with a REPLANCA error. I'd abort them.
ID: 44627 · Report as offensive     Reply Quote
Profile Byron Leigh Hatch @ team Carl ...
Avatar

Send message
Joined: 17 Aug 04
Posts: 289
Credit: 44,103,664
RAC: 0
Message 44628 - Posted: 6 Aug 2012, 3:18:02 UTC

hello everyone,

sorry but I have not had time to read this whole thread.

I'm crunching the following 4 wu and they seem to be returning zip files ok.

and I was wondering if it is ok to let them continue to run ?

hadam3p_pnw_c6nd_1993_1_008091178 - - Sent - - 26 Jul 2012 14:03:18 UTC
hadam3p_pnw_c75k_1968_1_008091170 - - Sent - - 26 Jul 2012 14:03:18 UTC
hadcm3n_o44o_2100_40_008085978 - - - - - Sent - - 25 Jul 2012 20:48:43 UTC
hadam3p_eu_alis_1998_1_008068421 - - - - Sent - - 19 Jul 2012 18:02:52 UTC

my computer id 948812
my account userid=910

thanks ,
Byron
ID: 44628 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 44629 - Posted: 6 Aug 2012, 5:43:33 UTC - in response to Message 44628.  

There's 3 separate problems, all from around the time that your models were sent.
In order of when they happened to mine:

Some will fail at around 9-10 hours, between zips 1 & 2
Some will fail at around 19-20 hours
Some will have files that "can't be found", and cause download failures
And there were also models that ran OK.

The first 2 were due to REPLANCA errors; an auxiliary file not having the correct number of data. The 3rd was an error with the path of a mirror server.

All models were deleted from the download pool, but there are still re-sends, caused by people not starting work that they received back then.

If you're running any of the failures you'll soon find out.


Backups: Here
ID: 44629 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2167
Credit: 64,524,430
RAC: 6,337
Message 44630 - Posted: 6 Aug 2012, 14:33:19 UTC - in response to Message 44628.  

and I was wondering if it is ok to let them continue to run ?

hadam3p_pnw_c6nd_1993_1_008091178 - - Sent - - 26 Jul 2012 14:03:18 UTC
hadam3p_pnw_c75k_1968_1_008091170 - - Sent - - 26 Jul 2012 14:03:18 UTC
hadcm3n_o44o_2100_40_008085978 - - - - - Sent - - 25 Jul 2012 20:48:43 UTC
hadam3p_eu_alis_1998_1_008068421 - - - - Sent - - 19 Jul 2012 18:02:52 UTC

my computer id 948812
my account userid=910


Looks like all 4 of them should continue on okay. None look to be in the bad batches. You've already made enough progress on them that they've gotten past the typical failure points for EU and PNS models.
ID: 44630 · Report as offensive     Reply Quote
Profile Byron Leigh Hatch @ team Carl ...
Avatar

Send message
Joined: 17 Aug 04
Posts: 289
Credit: 44,103,664
RAC: 0
Message 44633 - Posted: 7 Aug 2012, 11:49:53 UTC - in response to Message 44630.  

Thank you geophi and Les Bayliss for your reply

Yes all 4 seem to be continuing ok with no problems.

So I will let them continue to run to the end.

thanks,
Byron
ID: 44633 · Report as offensive     Reply Quote
AlphaLaser

Send message
Joined: 21 Oct 06
Posts: 5
Credit: 2,162,915
RAC: 0
Message 44690 - Posted: 13 Aug 2012, 3:38:01 UTC

I just recently got a result error with the following stdout:


<core_client_version>6.10.58</core_client_version>
<![CDATA[
<stderr_txt>

Model crashed: INITTIME: Atmosphere basis time mismatch tmp/xaakm.pipe_dummy 2048
Leaving CPDN_Main::Monitor...
Called boinc_finish

</stderr_txt>
<message>
<file_xfer_error>
<file_name>hadam3p_eu_69wa_2000_1_008138105_0_1.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_69wa_2000_1_008138105_0_2.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_69wa_2000_1_008138105_0_3.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_69wa_2000_1_008138105_0_4.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_69wa_2000_1_008138105_0_5.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_69wa_2000_1_008138105_0_6.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_69wa_2000_1_008138105_0_7.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_69wa_2000_1_008138105_0_8.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_69wa_2000_1_008138105_0_9.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_69wa_2000_1_008138105_0_10.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_69wa_2000_1_008138105_0_11.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_69wa_2000_1_008138105_0_12.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_69wa_2000_1_008138105_0_13.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>

</message>
]]>


And also the following messages in the client:



8/12/2012 9:26:31 PM climateprediction.net Computation for task hadam3p_eu_69wa_2000_1_008138105_0 finished
8/12/2012 9:26:31 PM climateprediction.net Output file hadam3p_eu_69wa_2000_1_008138105_0_1.zip for task hadam3p_eu_69wa_2000_1_008138105_0 absent
8/12/2012 9:26:31 PM climateprediction.net Output file hadam3p_eu_69wa_2000_1_008138105_0_2.zip for task hadam3p_eu_69wa_2000_1_008138105_0 absent
8/12/2012 9:26:31 PM climateprediction.net Output file hadam3p_eu_69wa_2000_1_008138105_0_3.zip for task hadam3p_eu_69wa_2000_1_008138105_0 absent
8/12/2012 9:26:31 PM climateprediction.net Output file hadam3p_eu_69wa_2000_1_008138105_0_4.zip for task hadam3p_eu_69wa_2000_1_008138105_0 absent
8/12/2012 9:26:31 PM climateprediction.net Output file hadam3p_eu_69wa_2000_1_008138105_0_5.zip for task hadam3p_eu_69wa_2000_1_008138105_0 absent
8/12/2012 9:26:31 PM climateprediction.net Output file hadam3p_eu_69wa_2000_1_008138105_0_6.zip for task hadam3p_eu_69wa_2000_1_008138105_0 absent
8/12/2012 9:26:31 PM climateprediction.net Output file hadam3p_eu_69wa_2000_1_008138105_0_7.zip for task hadam3p_eu_69wa_2000_1_008138105_0 absent
8/12/2012 9:26:31 PM climateprediction.net Output file hadam3p_eu_69wa_2000_1_008138105_0_8.zip for task hadam3p_eu_69wa_2000_1_008138105_0 absent
8/12/2012 9:26:31 PM climateprediction.net Output file hadam3p_eu_69wa_2000_1_008138105_0_9.zip for task hadam3p_eu_69wa_2000_1_008138105_0 absent
8/12/2012 9:26:31 PM climateprediction.net Output file hadam3p_eu_69wa_2000_1_008138105_0_10.zip for task hadam3p_eu_69wa_2000_1_008138105_0 absent
8/12/2012 9:26:31 PM climateprediction.net Output file hadam3p_eu_69wa_2000_1_008138105_0_11.zip for task hadam3p_eu_69wa_2000_1_008138105_0 absent
8/12/2012 9:26:31 PM climateprediction.net Output file hadam3p_eu_69wa_2000_1_008138105_0_12.zip for task hadam3p_eu_69wa_2000_1_008138105_0 absent
8/12/2012 9:26:31 PM climateprediction.net Output file hadam3p_eu_69wa_2000_1_008138105_0_13.zip for task hadam3p_eu_69wa_2000_1_008138105_0 absent


Is it another kind of problem with the WUs?
ID: 44690 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 44691 - Posted: 13 Aug 2012, 5:04:27 UTC - in response to Message 44690.  

The files are missing because the model crashed soon after starting. So none of the output data files got created. It's BOINC complaining about not being able to find then.
Only the first couple of lines of the STDERR file are relevant.


Backups: Here
ID: 44691 · Report as offensive     Reply Quote
MarkJ
Avatar

Send message
Joined: 28 Mar 09
Posts: 126
Credit: 9,825,980
RAC: 0
Message 44747 - Posted: 20 Aug 2012, 11:44:54 UTC
Last modified: 20 Aug 2012, 11:45:58 UTC

Another one that crashed...


Model crashed: INITTIME: Atmosphere basis time mismatch tmp/xaakm.pipe_dummy 2048
Leaving CPDN_Main::Monitor...
Regional yearly means requires 12 input files got 0


Wu name: hadam3p_pnw_2yuc_1975_1_008145549_1
Created: 15 Aug 2012

I would link to it but your Akismet anti-spam system thinks your own URL's are spam. The wuid is 8300673
BOINC blog
ID: 44747 · Report as offensive     Reply Quote
Professor Desty Nova
Avatar

Send message
Joined: 19 Sep 04
Posts: 92
Credit: 1,936,173
RAC: 351
Message 45193 - Posted: 28 Oct 2012, 23:03:51 UTC

More of these REPLANCA errors in this UK Met Office Coupled Model Full Resolution Ocean WU created Friday http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=8395212

<core_client_version>7.0.28</core_client_version>
<![CDATA[
<message>
The device does not recognize the command. (0x16) - exit code 22 (0x16)
</message>
<stderr_txt>
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...

Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/pipe_dummy 2048

Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/pipe_dummy 2048

Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/pipe_dummy 2048

Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/pipe_dummy 2048

Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/pipe_dummy 2048

Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/pipe_dummy 2048
Sorry, too many model crashes! :-(
Called boinc_finish

</stderr_txt>
]]>




Professor Desty Nova
Researching Karma the Hard Way
ID: 45193 · Report as offensive     Reply Quote
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 45201 - Posted: 31 Oct 2012, 2:52:27 UTC

Thanks, Professor. The REPLANCA errors have been reported to Andy and Jonathan. If one task in a WU crashes with REPLANCA, all the tasks in that WU will, and on all OSs.
Cpdn news
ID: 45201 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3

Message boards : Number crunching : Output file absent & Too many errors (may have bug)

©2024 climateprediction.net