climateprediction.net home page
Output file absent & Too many errors (may have bug)
Output file absent & Too many errors (may have bug)
log in

Advanced search

Message boards : Number crunching : Output file absent & Too many errors (may have bug)

1 · 2 · 3 · Next
Author Message
skgiven
Avatar
Send message
Joined: 5 Jun 06
Posts: 27
Credit: 2,541,290
RAC: 556
Message 44562 - Posted: 22 Jul 2012, 10:49:56 UTC
Last modified: 22 Jul 2012, 10:52:57 UTC

Output file absent:

22/07/2012 10:38:50 | climateprediction.net | Computation for task hadam3p_eu_634j_2009_1_008071304_2 finished
22/07/2012 10:38:50 | climateprediction.net | Output file hadam3p_eu_634j_2009_1_008071304_2_2.zip for task hadam3p_eu_634j_2009_1_008071304_2 absent
22/07/2012 10:38:50 | climateprediction.net | Output file hadam3p_eu_634j_2009_1_008071304_2_3.zip for task hadam3p_eu_634j_2009_1_008071304_2 absent
22/07/2012 10:38:50 | climateprediction.net | Output file hadam3p_eu_634j_2009_1_008071304_2_4.zip for task hadam3p_eu_634j_2009_1_008071304_2 absent
22/07/2012 10:38:50 | climateprediction.net | Output file hadam3p_eu_634j_2009_1_008071304_2_5.zip for task hadam3p_eu_634j_2009_1_008071304_2 absent
22/07/2012 10:38:50 | climateprediction.net | Output file hadam3p_eu_634j_2009_1_008071304_2_6.zip for task hadam3p_eu_634j_2009_1_008071304_2 absent
22/07/2012 10:38:50 | climateprediction.net | Output file hadam3p_eu_634j_2009_1_008071304_2_7.zip for task hadam3p_eu_634j_2009_1_008071304_2 absent
22/07/2012 10:38:50 | climateprediction.net | Output file hadam3p_eu_634j_2009_1_008071304_2_8.zip for task hadam3p_eu_634j_2009_1_008071304_2 absent
22/07/2012 10:38:50 | climateprediction.net | Output file hadam3p_eu_634j_2009_1_008071304_2_9.zip for task hadam3p_eu_634j_2009_1_008071304_2 absent
22/07/2012 10:38:50 | climateprediction.net | Output file hadam3p_eu_634j_2009_1_008071304_2_10.zip for task hadam3p_eu_634j_2009_1_008071304_2 absent
22/07/2012 10:38:50 | climateprediction.net | Output file hadam3p_eu_634j_2009_1_008071304_2_11.zip for task hadam3p_eu_634j_2009_1_008071304_2 absent
22/07/2012 10:38:50 | climateprediction.net | Output file hadam3p_eu_634j_2009_1_008071304_2_12.zip for task hadam3p_eu_634j_2009_1_008071304_2 absent

14973021 8226418 1212547 22 Jul 2012 0:47:15 UTC 22 Jul 2012 10:30:11 UTC Error while computing 26,180.15 25,922.02 0.00 --- UK Met Office HADAM3P European Region v6.09

<core_client_version>7.0.28</core_client_version>
<![CDATA[
<stderr_txt>

Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/xaakm.pipe_dummy 2048
Leaving CPDN_Main::Monitor...
Called boinc_finish

</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_2_2.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_2_3.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_2_4.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_2_5.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_2_6.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_2_7.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_2_8.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_2_9.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_2_10.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_2_11.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_2_12.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>

</message>
]]>

-161 is a File Not Found error.

My System.

My Task

The WorkUnit

Notes. The Ethernet to Internet connection was disconnected at the time. Also running POEM (GPU), RNA world and yoyo tasks. Only 4 CPU threads used (due to POEM requirements/setup). Write to disk @900sec. No other system or Boinc issues.
____________

Eirik Redd
Send message
Joined: 31 Aug 04
Posts: 334
Credit: 52,489,688
RAC: 14,640
Message 44563 - Posted: 22 Jul 2012, 11:26:11 UTC - in response to Message 44562.

Yes, I've seen maybe a half-dozen of these in the last few weeks. Mal-formed tasks that have been automatically re-issued but won't ever work because of the
"REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH"
Jut let them die and don't worry it.
____________

skgiven
Avatar
Send message
Joined: 5 Jun 06
Posts: 27
Credit: 2,541,290
RAC: 556
Message 44564 - Posted: 22 Jul 2012, 13:03:15 UTC - in response to Message 44563.
Last modified: 22 Jul 2012, 13:04:01 UTC

Thanks for the confirmation. This sort of issue occurs at other projects too, usually when the researchers make a mistake when building the tasks, but was also caused by deprecated clients for auto-generated tasks.

Might it be possible/worth while to do an early trickle point, or add a file check routine, in order to reduce the loss in such situations; so they would fail earlier, rather than say after 10h?
____________

hagar
Send message
Joined: 6 Aug 04
Posts: 88
Credit: 12,973,134
RAC: 4,525
Message 44565 - Posted: 22 Jul 2012, 13:33:56 UTC

A quick check shows six out of 330 AM3P that I have run this year on three PCs (two XP, one Linux) have zonked out with an error, including this 'output file absent'. That's less than 2% attrition rate, which is very low compared to the much higher attrition rates on the longer models.

(I lost an AM3p and a CM3 yesterday to a very short power brownout that caused one PC and the internet router to reboot. The other PC, two laptops, monitors and a printer didn't blink.)

For an ensemble methodology, 2% attrition rate is probably not worth the effort of delving further into the reasons for the error. I simply accept there will be an attrition rate.
____________

Profile Dave Jackson
Send message
Joined: 15 May 09
Posts: 1357
Credit: 1,900,281
RAC: 3,506
Message 44566 - Posted: 22 Jul 2012, 17:09:41 UTC - in response to Message 44563.

Thanks, saves me searching for answers, I had two pnw tasks go like this for me yesterday, though there was a power cut involved as well so I can't be 100% sure of the cause.

Any typing errors due to not being used to the tiny netbook keyboard. - Atom slowly making it's way through two eu units. I will have to get the extra GB of memory to see if it makes any difference.

Profile mo.v
Volunteer moderator
Avatar
Send message
Joined: 29 Sep 04
Posts: 2359
Credit: 9,864,224
RAC: 4,506
Message 44567 - Posted: 22 Jul 2012, 17:55:28 UTC

This REPLANCA thing is an error in the model. It happened a few months ago so we need to check whether there's a new batch of models with the same problem. It looks as if the headers on ancillary files don't match:

http://cms.ncas.ac.uk/trac/UMHelpdesk/ticket/399

It's a real nuisance that the web pages for these regional models take ages to open up so it's not easy to see what's happening with different WUs.
____________
Cpdn news

skgiven
Avatar
Send message
Joined: 5 Jun 06
Posts: 27
Credit: 2,541,290
RAC: 556
Message 44568 - Posted: 22 Jul 2012, 20:06:53 UTC - in response to Message 44567.
Last modified: 22 Jul 2012, 20:09:51 UTC

From WU 8226400 to 8226430 there are 15 failed tasks, several have failed more than once, none have reported successfully.
All are UK Met Office HADAM3P European Region and all were created at around the same time (20 Jul 2012 5:50:00 to 5:59:00 UTC)

http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=8226418
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=8226422
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=8226419
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=8226418
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=8226417
____________

Profile mo.v
Volunteer moderator
Avatar
Send message
Joined: 29 Sep 04
Posts: 2359
Credit: 9,864,224
RAC: 4,506
Message 44569 - Posted: 24 Jul 2012, 0:26:34 UTC

I can't get the task pages to open for me at all, even after hours. I can only look at the WU and computer pages. So I can't see whether all the computers are crashing the models with the same error. (I'm discounting computers that can't run any climate models at all and need to have their daily quota minussed until their owners put things right.)

http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=8226420

Now why can this computer with Windows complete one of this batch of models?



____________
Cpdn news

Profile mo.v
Volunteer moderator
Avatar
Send message
Joined: 29 Sep 04
Posts: 2359
Credit: 9,864,224
RAC: 4,506
Message 44570 - Posted: 24 Jul 2012, 0:51:17 UTC

I've found some Windows machines with the error and two that have now completed their model. There's a single Mac that seems to be crunching one OK. All the other Macs I've found are crashing everything with the usual problem.


____________
Cpdn news

Profile mo.v
Volunteer moderator
Avatar
Send message
Joined: 29 Sep 04
Posts: 2359
Credit: 9,864,224
RAC: 4,506
Message 44571 - Posted: 24 Jul 2012, 1:25:37 UTC
Last modified: 24 Jul 2012, 1:25:59 UTC

I wonder whether something else unrelated (?) to the REPLANCA error is going on with the EU models. Look at Paolo's computer and its tasks.

It can process Hadcm, Hadam PNW and Hadam SA nicely. But it crashes every Hadam EU in less than a minute as if the computer was misconfigured. These can't all be REPLANCA crashes.
____________
Cpdn news

Belfry
Send message
Joined: 19 Apr 08
Posts: 179
Credit: 4,306,992
RAC: 100
Message 44572 - Posted: 24 Jul 2012, 13:05:38 UTC

Ah yes, Replanca. I gambled away a small fortune at its beach-side casinos; where I wined and dined an Italian woman whose name I cannot remember....

Where was I? Oh yes, task 14903295 a PNW, just turned this error up at around 98% completion. I have another PNW finishing up shortly, we'll see what happens.

Belfry
Send message
Joined: 19 Apr 08
Posts: 179
Credit: 4,306,992
RAC: 100
Message 44573 - Posted: 24 Jul 2012, 15:15:56 UTC
Last modified: 24 Jul 2012, 15:20:02 UTC

No, there is no Replanca ..., nor Italian women whose names I cannot remember for that matter. Just sounded like an exotic place name, like Pollenca or Menorca ;)

Edit: my other PNW finished fine.

skgiven
Avatar
Send message
Joined: 5 Jun 06
Posts: 27
Credit: 2,541,290
RAC: 556
Message 44574 - Posted: 24 Jul 2012, 22:20:08 UTC - in response to Message 44571.
Last modified: 24 Jul 2012, 23:10:24 UTC

Paolo's Hadam EU tasks on that computer are all crashing with an exit status of -2:

Outcome Client error
Client state Compute error
Exit status -2 (0xfffffffffffffffe)

I think this is an issue with the task or app and nothing to do with Windows, Boinc, manager or client or other apps.


Some of Paolo's other computers are failing due to the REPLANCA issue with Exit status 0, error_code -161 (file_xfer_error):

Exit status 0 (0x0)
Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH

Some of these don't seem to run (Error while downloading) but others do run (file_xfer_error):

14947580 8213025 19 Jul 2012 17:57:28 UTC 21 Jul 2012 3:16:32 UTC Error while computing 102,580.61 100,825.80 399.11 399.11 UK Met Office HADAM3P European Region v6.09

In this case could the trickle result in a failure (file_xfer_error) and this in turn cause the task to be killed, and could all this be linked to the servers availability/responsiveness (pages not loading)?

- More likely one of the ranges is out!
____________

[boinc.at] Nowi
Send message
Joined: 16 Jul 05
Posts: 32
Credit: 3,881,905
RAC: 1,160
Message 44577 - Posted: 25 Jul 2012, 11:49:33 UTC

I have the Replanca problem, too. Four models in a row have a computation error after about 13000 s of computation time.
____________

Les Bayliss
Volunteer moderator
Send message
Joined: 5 Sep 04
Posts: 6231
Credit: 14,607,204
RAC: 543
Message 44578 - Posted: 26 Jul 2012, 0:28:29 UTC

Lots of people seem to be getting this. I'm up to my 4th or 5th failure. :(

Information that would be useful:
The actual name of the failed model.
Roughly when it failed.
If you have noticed a mysterious "zip 13" file has been created.

e.g. For one of mine:
hadam3p_eu_8aow_2005_1_008058020_0
REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH

This was between zips 2 and 3, at 25 hours 47 minutes 39 seconds, and a zip 13 was created.

____________
Backups: Here

[boinc.at] Nowi
Send message
Joined: 16 Jul 05
Posts: 32
Credit: 3,881,905
RAC: 1,160
Message 44580 - Posted: 26 Jul 2012, 8:38:20 UTC - in response to Message 44578.

Here are my failed WU. All failed with Replanca in the stderr.out

hadam3p_eu_cqxv_2000_1_008083091_2 zip1, zip13 uploaded
hadam3p_eu_ctbo_2009_1_008084522_1 zip1, zip13 uploaded
hadam3p_eu_ctx1_2008_1_008084858_0 zip1, zip13 uploaded
hadam3p_eu_a74l_1990_1_008067608_1 crashed after 8.79 s no zips uploaded
hadam3p_eu_ct79_2004_1_008084440_0 zip1, zip13 uploaded
hadam3p_eu_csgf_2006_1_008083996_0 zip1, zip13 uploaded
hadam3p_eu_crlu_2005_1_008083482_0 zip1, zip13 uploaded
hadam3p_eu_cr5j_2001_1_008083225_0 zip1, zip13 uploaded

I hope that will help.


____________

Dave Roberts
Send message
Joined: 15 Jan 11
Posts: 107
Credit: 2,120,159
RAC: 312
Message 44581 - Posted: 26 Jul 2012, 9:05:53 UTC

I don't know if this info. is useful for comparison/investigative purposes - but just in case...
One of my computers (ID: 1142892 ) has been running tasks of this model successfully for a while, the latest (Task ID 8210373) successfully completing yesterday. The previous run was Task ID 14734712, which completed successfully on 31st May.

Nigel Garvey
Send message
Joined: 5 May 10
Posts: 46
Credit: 762,215
RAC: 0
Message 44582 - Posted: 26 Jul 2012, 9:09:45 UTC - in response to Message 44578.

hadam3p_eu_cqgw_2005_1_008082804_0 and hadam3p_eu_cqgu_2003_1_008082803_0. Downloaded at 09:14 BST yesterday and run in parallel from then until they apparently "completed" within seconds of each other at getting on for 01:00 this morning. Files _2 to _12 were reported missing and there was indeed a file _13 apparently waiting to be uploaded when network activity resumed. I only remember there being one such _13 file, but I wasn't paying particular attention at the time. Although supposedly several MB in size, it disappeared instantly from the Transfers window when the BOINC client contacted the server.

"REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH" error in both cases.

Mac OS 10.6.8. BOINC 7.0.28.


NG

Profile mo.v
Volunteer moderator
Avatar
Send message
Joined: 29 Sep 04
Posts: 2359
Credit: 9,864,224
RAC: 4,506
Message 44583 - Posted: 26 Jul 2012, 10:07:26 UTC

Dave, the two models you mentioned were sent to you on 23 May and 17 July so they were from earlier batches of EU models. The batch generating so many REPLANCA errors was I think generated starting on 22 July. I still can't get any task pages for the regional models to open up though so I can't check what I say from the stderr files of crashed models.


____________
Cpdn news

Dave Roberts
Send message
Joined: 15 Jan 11
Posts: 107
Credit: 2,120,159
RAC: 312
Message 44586 - Posted: 26 Jul 2012, 10:14:06 UTC
Last modified: 26 Jul 2012, 10:18:05 UTC

Re my previous post on successful completions - I've just had a look at the messages and found the following, regarding successful uploads of zip 13 files after successful uploads of zips 1-12.

Wed Jul 25 22:34:37 2012 climateprediction.net Started upload of hadam3p_eu_9xz6_1991_1_008055259_0_12.zip
Wed Jul 25 22:39:48 2012 climateprediction.net Finished upload of hadam3p_eu_9xz6_1991_1_008055259_0_12.zip
Wed Jul 25 22:52:59 2012 climateprediction.net Started upload of hadam3p_eu_9xz6_1991_1_008055259_0_13.zip
Wed Jul 25 22:53:02 2012 climateprediction.net Computation for task hadam3p_eu_9xz6_1991_1_008055259_0 finished
Wed Jul 25 22:53:03 2012 climateprediction.net Starting hadam3p_eu_ctxm_2007_1_008084866_0
Wed Jul 25 22:53:03 2012 climateprediction.net Starting task hadam3p_eu_ctxm_2007_1_008084866_0 using hadam3p_eu version 609
Wed Jul 25 23:05:50 2012 climateprediction.net Finished upload of hadam3p_eu_9xz6_1991_1_008055259_0_13.zip

mo. v - Was preparing this before I saw your post.

1 · 2 · 3 · Next

Message boards : Number crunching : Output file absent & Too many errors (may have bug)


Main page · Your account · Message boards


Copyright © 2016 climateprediction.net