| Author | Message |
|
|
|
Output file absent:
22/07/2012 10:38:50 | climateprediction.net | Computation for task hadam3p_eu_634j_2009_1_008071304_2 finished
22/07/2012 10:38:50 | climateprediction.net | Output file hadam3p_eu_634j_2009_1_008071304_2_2.zip for task hadam3p_eu_634j_2009_1_008071304_2 absent
22/07/2012 10:38:50 | climateprediction.net | Output file hadam3p_eu_634j_2009_1_008071304_2_3.zip for task hadam3p_eu_634j_2009_1_008071304_2 absent
22/07/2012 10:38:50 | climateprediction.net | Output file hadam3p_eu_634j_2009_1_008071304_2_4.zip for task hadam3p_eu_634j_2009_1_008071304_2 absent
22/07/2012 10:38:50 | climateprediction.net | Output file hadam3p_eu_634j_2009_1_008071304_2_5.zip for task hadam3p_eu_634j_2009_1_008071304_2 absent
22/07/2012 10:38:50 | climateprediction.net | Output file hadam3p_eu_634j_2009_1_008071304_2_6.zip for task hadam3p_eu_634j_2009_1_008071304_2 absent
22/07/2012 10:38:50 | climateprediction.net | Output file hadam3p_eu_634j_2009_1_008071304_2_7.zip for task hadam3p_eu_634j_2009_1_008071304_2 absent
22/07/2012 10:38:50 | climateprediction.net | Output file hadam3p_eu_634j_2009_1_008071304_2_8.zip for task hadam3p_eu_634j_2009_1_008071304_2 absent
22/07/2012 10:38:50 | climateprediction.net | Output file hadam3p_eu_634j_2009_1_008071304_2_9.zip for task hadam3p_eu_634j_2009_1_008071304_2 absent
22/07/2012 10:38:50 | climateprediction.net | Output file hadam3p_eu_634j_2009_1_008071304_2_10.zip for task hadam3p_eu_634j_2009_1_008071304_2 absent
22/07/2012 10:38:50 | climateprediction.net | Output file hadam3p_eu_634j_2009_1_008071304_2_11.zip for task hadam3p_eu_634j_2009_1_008071304_2 absent
22/07/2012 10:38:50 | climateprediction.net | Output file hadam3p_eu_634j_2009_1_008071304_2_12.zip for task hadam3p_eu_634j_2009_1_008071304_2 absent
14973021 8226418 1212547 22 Jul 2012 0:47:15 UTC 22 Jul 2012 10:30:11 UTC Error while computing 26,180.15 25,922.02 0.00 --- UK Met Office HADAM3P European Region v6.09
<core_client_version>7.0.28</core_client_version>
<![CDATA[
<stderr_txt>
Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/xaakm.pipe_dummy 2048
Leaving CPDN_Main::Monitor...
Called boinc_finish
</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_2_2.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_2_3.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_2_4.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_2_5.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_2_6.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_2_7.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_2_8.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_2_9.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_2_10.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_2_11.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_2_12.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
</message>
]]>
-161 is a File Not Found error.
My System.
My Task
The WorkUnit
Notes. The Ethernet to Internet connection was disconnected at the time. Also running POEM (GPU), RNA world and yoyo tasks. Only 4 CPU threads used (due to POEM requirements/setup). Write to disk @900sec. No other system or Boinc issues.
____________
 |
|
|
|
|
|
Yes, I've seen maybe a half-dozen of these in the last few weeks. Mal-formed tasks that have been automatically re-issued but won't ever work because of the
"REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH"
Jut let them die and don't worry it.
____________
|
|
|
|
|
|
Thanks for the confirmation. This sort of issue occurs at other projects too, usually when the researchers make a mistake when building the tasks, but was also caused by deprecated clients for auto-generated tasks.
Might it be possible/worth while to do an early trickle point, or add a file check routine, in order to reduce the loss in such situations; so they would fail earlier, rather than say after 10h?
____________
 |
|
|
|
|
|
A quick check shows six out of 330 AM3P that I have run this year on three PCs (two XP, one Linux) have zonked out with an error, including this 'output file absent'. That's less than 2% attrition rate, which is very low compared to the much higher attrition rates on the longer models.
(I lost an AM3p and a CM3 yesterday to a very short power brownout that caused one PC and the internet router to reboot. The other PC, two laptops, monitors and a printer didn't blink.)
For an ensemble methodology, 2% attrition rate is probably not worth the effort of delving further into the reasons for the error. I simply accept there will be an attrition rate.
____________
|
|
|
|
|
|
Thanks, saves me searching for answers, I had two pnw tasks go like this for me yesterday, though there was a power cut involved as well so I can't be 100% sure of the cause.
Any typing errors due to not being used to the tiny netbook keyboard. - Atom slowly making it's way through two eu units. I will have to get the extra GB of memory to see if it makes any difference. |
|
|
mo.vForum moderator
 Send message Joined: Sep 29 04 Posts: 2267 Credit: 5,304,959 RAC: 2,077
|
|
This REPLANCA thing is an error in the model. It happened a few months ago so we need to check whether there's a new batch of models with the same problem. It looks as if the headers on ancillary files don't match:
http://cms.ncas.ac.uk/trac/UMHelpdesk/ticket/399
It's a real nuisance that the web pages for these regional models take ages to open up so it's not easy to see what's happening with different WUs.
____________
Cpdn news
5 CPDN READMEs |
|
|
|
|
|
From WU 8226400 to 8226430 there are 15 failed tasks, several have failed more than once, none have reported successfully.
All are UK Met Office HADAM3P European Region and all were created at around the same time (20 Jul 2012 5:50:00 to 5:59:00 UTC)
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=8226418
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=8226422
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=8226419
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=8226418
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=8226417
____________
 |
|
|
mo.vForum moderator
 Send message Joined: Sep 29 04 Posts: 2267 Credit: 5,304,959 RAC: 2,077
|
|
I can't get the task pages to open for me at all, even after hours. I can only look at the WU and computer pages. So I can't see whether all the computers are crashing the models with the same error. (I'm discounting computers that can't run any climate models at all and need to have their daily quota minussed until their owners put things right.)
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=8226420
Now why can this computer with Windows complete one of this batch of models?
____________
Cpdn news
5 CPDN READMEs |
|
|
mo.vForum moderator
 Send message Joined: Sep 29 04 Posts: 2267 Credit: 5,304,959 RAC: 2,077
|
|
I've found some Windows machines with the error and two that have now completed their model. There's a single Mac that seems to be crunching one OK. All the other Macs I've found are crashing everything with the usual problem.
____________
Cpdn news
5 CPDN READMEs |
|
|
mo.vForum moderator
 Send message Joined: Sep 29 04 Posts: 2267 Credit: 5,304,959 RAC: 2,077
|
|
I wonder whether something else unrelated (?) to the REPLANCA error is going on with the EU models. Look at Paolo's computer and its tasks.
It can process Hadcm, Hadam PNW and Hadam SA nicely. But it crashes every Hadam EU in less than a minute as if the computer was misconfigured. These can't all be REPLANCA crashes.
____________
Cpdn news
5 CPDN READMEs |
|
|
|
|
|
Ah yes, Replanca. I gambled away a small fortune at its beach-side casinos; where I wined and dined an Italian woman whose name I cannot remember....
Where was I? Oh yes, task 14903295 a PNW, just turned this error up at around 98% completion. I have another PNW finishing up shortly, we'll see what happens. |
|
|
|
|
|
No, there is no Replanca ..., nor Italian women whose names I cannot remember for that matter. Just sounded like an exotic place name, like Pollenca or Menorca ;)
Edit: my other PNW finished fine. |
|
|
|
|
|
Paolo's Hadam EU tasks on that computer are all crashing with an exit status of -2:
Outcome Client error
Client state Compute error
Exit status -2 (0xfffffffffffffffe)
I think this is an issue with the task or app and nothing to do with Windows, Boinc, manager or client or other apps.
Some of Paolo's other computers are failing due to the REPLANCA issue with Exit status 0, error_code -161 (file_xfer_error):
Exit status 0 (0x0)
Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH
Some of these don't seem to run (Error while downloading) but others do run (file_xfer_error):
14947580 8213025 19 Jul 2012 17:57:28 UTC 21 Jul 2012 3:16:32 UTC Error while computing 102,580.61 100,825.80 399.11 399.11 UK Met Office HADAM3P European Region v6.09
In this case could the trickle result in a failure (file_xfer_error) and this in turn cause the task to be killed, and could all this be linked to the servers availability/responsiveness (pages not loading)?
- More likely one of the ranges is out!
____________
 |
|
|
|
|
|
I have the Replanca problem, too. Four models in a row have a computation error after about 13000 s of computation time.
____________
|
|
|
|
|
|
Lots of people seem to be getting this. I'm up to my 4th or 5th failure. :(
Information that would be useful:
The actual name of the failed model.
Roughly when it failed.
If you have noticed a mysterious "zip 13" file has been created.
e.g. For one of mine:
hadam3p_eu_8aow_2005_1_008058020_0
REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH
This was between zips 2 and 3, at 25 hours 47 minutes 39 seconds, and a zip 13 was created.
____________
Backups: Here |
|
|
|
|
|
Here are my failed WU. All failed with Replanca in the stderr.out
hadam3p_eu_cqxv_2000_1_008083091_2 zip1, zip13 uploaded
hadam3p_eu_ctbo_2009_1_008084522_1 zip1, zip13 uploaded
hadam3p_eu_ctx1_2008_1_008084858_0 zip1, zip13 uploaded
hadam3p_eu_a74l_1990_1_008067608_1 crashed after 8.79 s no zips uploaded
hadam3p_eu_ct79_2004_1_008084440_0 zip1, zip13 uploaded
hadam3p_eu_csgf_2006_1_008083996_0 zip1, zip13 uploaded
hadam3p_eu_crlu_2005_1_008083482_0 zip1, zip13 uploaded
hadam3p_eu_cr5j_2001_1_008083225_0 zip1, zip13 uploaded
I hope that will help.
____________
|
|
|
|
|
|
I don't know if this info. is useful for comparison/investigative purposes - but just in case...
One of my computers (ID: 1142892 ) has been running tasks of this model successfully for a while, the latest (Task ID 8210373) successfully completing yesterday. The previous run was Task ID 14734712, which completed successfully on 31st May. |
|
|
|
|
|
hadam3p_eu_cqgw_2005_1_008082804_0 and hadam3p_eu_cqgu_2003_1_008082803_0. Downloaded at 09:14 BST yesterday and run in parallel from then until they apparently "completed" within seconds of each other at getting on for 01:00 this morning. Files _2 to _12 were reported missing and there was indeed a file _13 apparently waiting to be uploaded when network activity resumed. I only remember there being one such _13 file, but I wasn't paying particular attention at the time. Although supposedly several MB in size, it disappeared instantly from the Transfers window when the BOINC client contacted the server.
"REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH" error in both cases.
Mac OS 10.6.8. BOINC 7.0.28.
NG |
|
|
mo.vForum moderator
 Send message Joined: Sep 29 04 Posts: 2267 Credit: 5,304,959 RAC: 2,077
|
|
Dave, the two models you mentioned were sent to you on 23 May and 17 July so they were from earlier batches of EU models. The batch generating so many REPLANCA errors was I think generated starting on 22 July. I still can't get any task pages for the regional models to open up though so I can't check what I say from the stderr files of crashed models.
____________
Cpdn news
5 CPDN READMEs |
|
|
|
|
|
Re my previous post on successful completions - I've just had a look at the messages and found the following, regarding successful uploads of zip 13 files after successful uploads of zips 1-12.
Wed Jul 25 22:34:37 2012 climateprediction.net Started upload of hadam3p_eu_9xz6_1991_1_008055259_0_12.zip
Wed Jul 25 22:39:48 2012 climateprediction.net Finished upload of hadam3p_eu_9xz6_1991_1_008055259_0_12.zip
Wed Jul 25 22:52:59 2012 climateprediction.net Started upload of hadam3p_eu_9xz6_1991_1_008055259_0_13.zip
Wed Jul 25 22:53:02 2012 climateprediction.net Computation for task hadam3p_eu_9xz6_1991_1_008055259_0 finished
Wed Jul 25 22:53:03 2012 climateprediction.net Starting hadam3p_eu_ctxm_2007_1_008084866_0
Wed Jul 25 22:53:03 2012 climateprediction.net Starting task hadam3p_eu_ctxm_2007_1_008084866_0 using hadam3p_eu version 609
Wed Jul 25 23:05:50 2012 climateprediction.net Finished upload of hadam3p_eu_9xz6_1991_1_008055259_0_13.zip
mo. v - Was preparing this before I saw your post. |
|
|
|
|
|
The reason for asking for the file names of faulty models, is that the project people want to know which years have the error.
And it seems like they're spread over a lot of years.
____________
Backups: Here |
|
|
|
|
The reason for asking for the file names of faulty models, is that the project people want to know which years have the error.
And it seems like they're spread over a lot of years.
In that case, I've gor one here: hadam3p_eu_8a9u_2003_1_008057882_1. Note that this one was sent to me the 18th of July. |
|
|
|
|
Files _2 to _12 were reported missing and there was indeed a file _13 apparently waiting to be uploaded when network activity resumed. I only remember there being one such _13 file, but I wasn't paying particular attention at the time. Although supposedly several MB in size, it disappeared instantly from the Transfers window when the BOINC client contacted the server.
That happens because an error automatically means the BOINC client can report the task to the server. When the scheduler request doing that is acknowledged the BOINC client deletes all references to the task (including any pending or in progress uploads).
____________
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer |
|
|
|
|
|
This may be related. Certainly, hadam3p_eu's exiting early (some almost instantly after the task first uploads) and as a result of exiting already (this is I think a symptom), task result uploads in zip files are missing:
http://climateprediction.net/board/viewtopic.php?f=4&t=10619 |
|
|
|
|
|
Some details from different systems:
Task 14973021
Name hadam3p_eu_634j_2009_1_008071304_2
Workunit 8226418
Created 22 Jul 2012 0:43:29 UTC
Sent 22 Jul 2012 0:47:15 UTC
Received 22 Jul 2012 10:30:11 UTC
Server state Over
Outcome Client error
Client state Compute error
Exit status 0 (0x0)
Computer ID 1212547
Report deadline 4 Jul 2013 6:07:15 UTC
Run time 26,180.15
CPU time 25,922.02
Validate state Invalid
Claimed credit 200.38
Granted credit 200.38
application version UK Met Office HADAM3P European Region v6.09
Stderr show hide
<core_client_version>7.0.28</core_client_version>
<![CDATA[
<stderr_txt>
Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/xaakm.pipe_dummy 2048
Leaving CPDN_Main::Monitor...
Called boinc_finish
</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_2_2.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_2_3.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_2_4.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_2_5.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_2_6.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_2_7.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_2_8.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_2_9.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_2_10.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_2_11.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_2_12.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
</message>
]]>
Name hadam3p_eu_2j5d_1987_1_008071308_1
Workunit 8226422
Created 20 Jul 2012 7:01:50 UTC
Sent 20 Jul 2012 7:52:01 UTC
Received 21 Jul 2012 8:19:10 UTC
Server state Over
Outcome Client error
Client state Compute error
Exit status 0 (0x0)
Computer ID 1126062
Report deadline 2 Jul 2013 13:12:01 UTC
Run time 13,805.54
CPU time 13,678.24
Validate state Invalid
Claimed credit 0.00
Granted credit 0.00
application version UK Met Office HADAM3P European Region v6.09
Stderr show hide
<core_client_version>6.10.58</core_client_version>
<![CDATA[
<stderr_txt>
Signal 15 received, exiting...
Called boinc_finish
Signal 15 received, exiting...
Called boinc_finish
Signal 15 received, exiting...
Called boinc_finish
SIGSEGV: segmentation violation
Stack trace (14 frames):
/home/aida/BOINC/projects/climateprediction.net/hadam3p_eu_um_6.09_i686-pc-linux-gnu(boinc_catch_signal+0x6f)[0x836e1cf]
[0xf0f87400]
/home/aida/BOINC/projects/climateprediction.net/hadam3p_eu_um_6.09_i686-pc-linux-gnu[0x8136129]
/home/aida/BOINC/projects/climateprediction.net/hadam3p_eu_um_6.09_i686-pc-linux-gnu[0x813c074]
/home/aida/BOINC/projects/climateprediction.net/hadam3p_eu_um_6.09_i686-pc-linux-gnu[0x8131c87]
/home/aida/BOINC/projects/climateprediction.net/hadam3p_eu_um_6.09_i686-pc-linux-gnu[0x813d6aa]
/home/aida/BOINC/projects/climateprediction.net/hadam3p_eu_um_6.09_i686-pc-linux-gnu[0x8133fca]
/home/aida/BOINC/projects/climateprediction.net/hadam3p_eu_um_6.09_i686-pc-linux-gnu[0x8078e6f]
/home/aida/BOINC/projects/climateprediction.net/hadam3p_eu_um_6.09_i686-pc-linux-gnu[0x82d73ae]
/home/aida/BOINC/projects/climateprediction.net/hadam3p_eu_um_6.09_i686-pc-linux-gnu[0x82f8867]
/home/aida/BOINC/projects/climateprediction.net/hadam3p_eu_um_6.09_i686-pc-linux-gnu[0x82f14bb]
/home/aida/BOINC/projects/climateprediction.net/hadam3p_eu_um_6.09_i686-pc-linux-gnu[0x82f97f6]
/lib32/libc.so.6(__libc_start_main+0xe5)[0xf0df342d]
/home/aida/BOINC/projects/climateprediction.net/hadam3p_eu_um_6.09_i686-pc-linux-gnu[0x804caf1]
Exiting...
Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=3708, selfPID=3695, iMonCtr=1
Model crash detected, will try to restart...
Leaving CPDN_Main::Monitor...
Called boinc_finish
</stderr_txt>
<message>
<file_xfer_error>
<file_name>hadam3p_eu_2j5d_1987_1_008071308_1_1.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_2j5d_1987_1_008071308_1_2.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_2j5d_1987_1_008071308_1_3.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_2j5d_1987_1_008071308_1_4.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_2j5d_1987_1_008071308_1_5.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_2j5d_1987_1_008071308_1_6.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_2j5d_1987_1_008071308_1_7.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_2j5d_1987_1_008071308_1_8.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_2j5d_1987_1_008071308_1_9.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_2j5d_1987_1_008071308_1_10.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_2j5d_1987_1_008071308_1_11.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_2j5d_1987_1_008071308_1_12.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_2j5d_1987_1_008071308_1_13.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
</message>
]]>
Name hadam3p_eu_60t3_2009_1_008071305_0
Workunit 8226419
Created 20 Jul 2012 5:56:54 UTC
Sent 20 Jul 2012 6:02:06 UTC
Received 22 Jul 2012 3:45:28 UTC
Server state Over
Outcome Client error
Client state Compute error
Exit status 0 (0x0)
Computer ID 1192477
Report deadline 2 Jul 2013 11:22:06 UTC
Run time 74,050.46
CPU time 72,651.55
Validate state Invalid
Claimed credit 200.38
Granted credit 200.38
application version UK Met Office HADAM3P European Region v6.09
Stderr show hide
<core_client_version>6.12.34</core_client_version>
<![CDATA[
<stderr_txt>
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/xaakm.pipe_dummy 2048
Leaving CPDN_Main::Monitor...
Called boinc_finish
</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>hadam3p_eu_60t3_2009_1_008071305_0_2.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_60t3_2009_1_008071305_0_3.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_60t3_2009_1_008071305_0_4.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_60t3_2009_1_008071305_0_5.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_60t3_2009_1_008071305_0_6.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_60t3_2009_1_008071305_0_7.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_60t3_2009_1_008071305_0_8.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_60t3_2009_1_008071305_0_9.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_60t3_2009_1_008071305_0_10.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_60t3_2009_1_008071305_0_11.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_60t3_2009_1_008071305_0_12.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
</message>
]]>
Name hadam3p_eu_634j_2009_1_008071304_2
Workunit 8226418
Created 22 Jul 2012 0:43:29 UTC
Sent 22 Jul 2012 0:47:15 UTC
Received 22 Jul 2012 10:30:11 UTC
Server state Over
Outcome Client error
Client state Compute error
Exit status 0 (0x0)
Computer ID 1212547
Report deadline 4 Jul 2013 6:07:15 UTC
Run time 26,180.15
CPU time 25,922.02
Validate state Invalid
Claimed credit 200.38
Granted credit 200.38
application version UK Met Office HADAM3P European Region v6.09
Stderr show hide
<core_client_version>7.0.28</core_client_version>
<![CDATA[
<stderr_txt>
Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/xaakm.pipe_dummy 2048
Leaving CPDN_Main::Monitor...
Called boinc_finish
</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_2_2.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_2_3.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_2_4.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_2_5.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_2_6.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_2_7.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_2_8.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_2_9.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_2_10.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_2_11.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_2_12.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
</message>
]]>
Name hadam3p_eu_634j_2009_1_008071304_1
Workunit 8226418
Created 21 Jul 2012 5:03:17 UTC
Sent 21 Jul 2012 5:11:11 UTC
Received 22 Jul 2012 0:43:28 UTC
Server state Over
Outcome Client error
Client state Compute error
Exit status 0 (0x0)
Computer ID 1221572
Report deadline 3 Jul 2013 10:31:11 UTC
Run time 54,671.36
CPU time 54,503.55
Validate state Invalid
Claimed credit 200.38
Granted credit 200.38
application version UK Met Office HADAM3P European Region v6.09
Stderr show hide
<core_client_version>7.0.25</core_client_version>
<![CDATA[
<stderr_txt>
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/xaakm.pipe_dummy 2048
Leaving CPDN_Main::Monitor...
Called boinc_finish
</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_1_2.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_1_3.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_1_4.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_1_5.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_1_6.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_1_7.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_1_8.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_1_9.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_1_10.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_1_11.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_634j_2009_1_008071304_1_12.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
</message>
]]>
Name hadam3p_eu_6c44_2009_1_008071303_0
Workunit 8226417
Created 20 Jul 2012 5:56:29 UTC
Sent 20 Jul 2012 6:01:45 UTC
Received 21 Jul 2012 1:04:09 UTC
Server state Over
Outcome Client error
Client state Compute error
Exit status 0 (0x0)
Computer ID 915051
Report deadline 2 Jul 2013 11:21:45 UTC
Run time 47,264.36
CPU time 46,751.77
Validate state Invalid
Claimed credit 200.38
Granted credit 200.38
application version UK Met Office HADAM3P European Region v6.09
Stderr show hide
<core_client_version>7.0.28</core_client_version>
<![CDATA[
<stderr_txt>
Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/xaakm.pipe_dummy 2048
Regional Worker:: CPDN process is not running, exiting, bRetVal = 1, checkPID=4524, selfPID=4524, iMonCtr=2
Leaving CPDN_Main::Monitor...
Called boinc_finish
</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>hadam3p_eu_6c44_2009_1_008071303_0_2.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_6c44_2009_1_008071303_0_3.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_6c44_2009_1_008071303_0_4.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_6c44_2009_1_008071303_0_5.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_6c44_2009_1_008071303_0_6.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_6c44_2009_1_008071303_0_7.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_6c44_2009_1_008071303_0_8.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_6c44_2009_1_008071303_0_9.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_6c44_2009_1_008071303_0_10.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_6c44_2009_1_008071303_0_11.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_6c44_2009_1_008071303_0_12.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
</message>
]]>
____________
 |
|
|
mo.vForum moderator
 Send message Joined: Sep 29 04 Posts: 2267 Credit: 5,304,959 RAC: 2,077
|
|
Thanks for the details, skgiven. I was mistaken in thinking that the REPLANCA batches started on 22 July. There were batches created on 21 and 20 July too.
____________
Cpdn news
5 CPDN READMEs |
|
|
|
|
|
Possibly a few more created more recently
hadam3p_eu_cryy_2004_1_008083704_1 Sent 25 Jul 2012 3:03:18 UTC
Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH
hadam3p_eu_cu52_2000_1_008084996_0 Sent 24 Jul 2012 14:17:28 UTC
Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH
hadam3p_eu_cssi_2001_1_008084199_0 Sent 24 Jul 2012 20:21:54 UTC
Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH
hadam3p_eu_cqol_2007_1_008082936_0 Sent 25 Jul 2012 7:12:34 UTC
Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH
hadam3p_eu_colq_2007_1_008081725_0 Sent 25 Jul 2012 17:37:05 UTC
Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH
but this is a small percentage out of the wus the last few days
Most of what my machines downloaded last 3 days have no problems at all
____________
|
|
|
|
|
|
Should we report all instances of REPLANCA failures? I've just had my 1st.
Messages :-
Fri Jul 27 06:02:13 2012 Started upload of hadam3p_eu_cq3s_2006_1_008082615_2_1.zip
Fri Jul 27 06:07:08 2012 Finished upload of hadam3p_eu_cq3s_2006_1_008082615_2_1.zip
Fri Jul 27 07:58:26 2012 Started upload of hadam3p_eu_cq3s_2006_1_008082615_2_13.zip
Fri Jul 27 07:58:29 2012 Computation for task hadam3p_eu_cq3s_2006_1_008082615_2 finished
Fri Jul 27 07:58:29 2012 Output file hadam3p_eu_cq3s_2006_1_008082615_2_2.zip for task hadam3p_eu_cq3s_2006_1_008082615_2 absent
Fri Jul 27 07:58:29 2012 Output file hadam3p_eu_cq3s_2006_1_008082615_2_3.zip for task hadam3p_eu_cq3s_2006_1_008082615_2 absent
Fri Jul 27 07:58:29 2012 Output file hadam3p_eu_cq3s_2006_1_008082615_2_4.zip for task hadam3p_eu_cq3s_2006_1_008082615_2 absent
Fri Jul 27 07:58:29 2012 Output file hadam3p_eu_cq3s_2006_1_008082615_2_5.zip for task hadam3p_eu_cq3s_2006_1_008082615_2 absent
Fri Jul 27 07:58:29 2012 Output file hadam3p_eu_cq3s_2006_1_008082615_2_6.zip for task hadam3p_eu_cq3s_2006_1_008082615_2 absent
Fri Jul 27 07:58:29 2012 Output file hadam3p_eu_cq3s_2006_1_008082615_2_7.zip for task hadam3p_eu_cq3s_2006_1_008082615_2 absent
Fri Jul 27 07:58:29 2012 Output file hadam3p_eu_cq3s_2006_1_008082615_2_8.zip for task hadam3p_eu_cq3s_2006_1_008082615_2 absent
Fri Jul 27 07:58:29 2012 Output file hadam3p_eu_cq3s_2006_1_008082615_2_9.zip for task hadam3p_eu_cq3s_2006_1_008082615_2 absent
Fri Jul 27 07:58:29 2012 Output file hadam3p_eu_cq3s_2006_1_008082615_2_10.zip for task hadam3p_eu_cq3s_2006_1_008082615_2 absent
Fri Jul 27 07:58:29 2012 Output file hadam3p_eu_cq3s_2006_1_008082615_2_11.zip for task hadam3p_eu_cq3s_2006_1_008082615_2 absent
Fri Jul 27 07:58:29 2012 Output file hadam3p_eu_cq3s_2006_1_008082615_2_12.zip for task hadam3p_eu_cq3s_2006_1_008082615_2 absent
Stderror :-
Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/xaakm.pipe_dummy 2048
Leaving CPDN_Main::Monitor...
Called boinc_finish
|
|
|
|
|
|
I think we've worked out that it's EU models that have the fault.
Set your prfs for only PNW, and you should be OK.
____________
Backups: Here |
|
|
|
|
|
I have failing pnw, too:
hadam3p_pnw_bdmc_1973_1_008097714_0
hadam3p_pnw_b9zc_1977_1_008097176_0
They failed after 10 s of runtime!
stderr shows:
<core_client_version>7.0.28</core_client_version>
<![CDATA[
<stderr_txt>
GCM: BUFFIN : Read Failed: No such file or directory
GCM : BUFFIN: C I/O Error feof - Unit 30 - Return code = 16
GCM : BUFFIN: C I/O Error feof - Unit 30 - Return code = 16
Model crashed: REPLANCA :I/O ERROR tmp/xaakm.pipe_dummy 2048
Regional Worker:: CPDN process is not running, exiting, bRetVal = 1, checkPID=15304, selfPID=15304, iMonCtr=2
Leaving CPDN_Main::Monitor...
Regional yearly means requires 12 input files got 0
Called boinc_finish
</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>hadam3p_pnw_bdmc_1973_1_008097714_0_1.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_pnw_bdmc_1973_1_008097714_0_2.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_pnw_bdmc_1973_1_008097714_0_3.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_pnw_bdmc_1973_1_008097714_0_4.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_pnw_bdmc_1973_1_008097714_0_5.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_pnw_bdmc_1973_1_008097714_0_6.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_pnw_bdmc_1973_1_008097714_0_7.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_pnw_bdmc_1973_1_008097714_0_8.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_pnw_bdmc_1973_1_008097714_0_9.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_pnw_bdmc_1973_1_008097714_0_10.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_pnw_bdmc_1973_1_008097714_0_11.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_pnw_bdmc_1973_1_008097714_0_12.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_pnw_bdmc_1973_1_008097714_0_13.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
</message>
]]>
____________
|
|
|
|
|
|
It's a waste of time and space posting long strings of "error 161" messages.
These aren't about model failures. They just mean that BOINC can't find these files when it tries to upload them. Which is obvious, as they were never created in the first place. The model crashed before getting that far.
____________
Backups: Here |
|
|
|
|
|
Not that I necessarily expect an answer, but I'd be curious to know why the European models are failing? |
|
|
|
|
|
Only only a small small fraction fraction are failing failing.
Because the download files are not exactly right.
And the problem will be or has been fixed already.
So when the problem work units clear the queue this problem will be gone.
And then, because this whole project is cutting edge and really complex, there will probably be a few more malformed work units later.
____________
|
|
|
|
|
|
"REPLANCA" is an error that means a program is expecting X number of values, but only found X-n.
It happens when a limited number of values is used to test a program, and then everything is increased to the full range of values, except for one of the ancillary files where the list of values doesn't get increased.
So someone in one of the research groups, has supplied the Oxford people with a faulty file.
The question then becomes: which file? from which research group? and for what range(s) of model dates?
***************
I also had one SAF model fail with this error, and Nowi is reporting PNW's failing with it.
____________
Backups: Here |
|
|
|
|
|
Yes I got a couple. Mine are all PNW models
REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/xaakm.pipe_dummy 2048
Leaving CPDN_Main::Monitor...
Regional yearly means requires 12 input files got 1
Link to work unit here
Les, do you want to know about these or do we just ignore them? I see there are 14,000+ PNW work units on the queue so there are bound to be more in there.
____________
BOINC blog |
|
|
|
|
|
Hi Mark
I'm not sure, but I guess we should know about the PNW baddies as well.
It's going to be another 24-30 hours before anyone shows up, but I'll pass on the news.
____________
Backups: Here |
|
|
|
|
|
Yep. I've had a PNW error overnight too. Same symptoms. A few more points awarded though. :)
hadam3p_pnw_bdp4_1993_1_008097733_0
NG |
|
|
|
|
Hi Mark
I'm not sure, but I guess we should know about the PNW baddies as well.
It's going to be another 24-30 hours before anyone shows up, but I'll pass on the news.
Replanca errors:
resultid=14901620
resultid=15011909
resultid=14819189
resultid=15021473
Some others complaining about files (no mention of Replanca though). These crash in about 600 seconds elapsed
Model crashed:
Leaving CPDN_Main::Monitor...
Regional yearly means requires 12 input files got 0
Called boinc_finish
resultid=14819102
resultid=14819127
And another which might just be some weird parameters:
Model crashed: INITTIME: Atmosphere basis time mismatch tmp/xaakm.pipe_dummy 2048
Regional Worker:: CPDN process is not running, exiting, bRetVal = 1, checkPID=964, selfPID=964, iMonCtr=2
Leaving CPDN_Main::Monitor...
Regional yearly means requires 12 input files got 0
resultid=14906965
____________
BOINC blog |
|
|
|
|
|
Just in case you are still collecting details of tasks with replanca error. http://climateapps2.oerc.ox.ac.uk/cpdnboinc/result.php?resultid=14975475 hadam3p_eu_ale0_2000_1_008070909_2 is one. I am suspicious though as this happened after the computer had just been restarted or at least that was when I noticed it and the zip13 uploaded.
Dave |
|
|
|
|
|
Some more Replanca errors...
resultid=15022759
resultid=15024598
resultid=15033209
resultid=15028563
resultid=15032537
resultid=15035539
resultid=15039466
resultid=15034026
resultid=15034029
resultid=15034537
resultid=15034564
resultid=15034565
Looks to me like they are all stuffed. Perhaps the project would be better served by cancelling the remaining ones on the queue that haven't been sent out and resubmitting them after fixing the replanca issue.
Whats really annoying is they run for 18-19 hours before they commit suicide and then to top it off they create the usual 32Mb _13 file to upload. Its probably useless anyway seeing as the model only has 1 of the 12 input files.
____________
BOINC blog |
|
|
|
|
|
My latest one to crash with replanca error was after about 40 hours which on my machine is 4 or 5 zip files worth. This was after a restart but the model had been suspended and file - exit used to shut boinc down before hibernating the computer? Has anyone else had them go this far before crashing?
Dave |
|
|
|
|
|
I see the (presumably offending) tasks have gone from the server.
Dave |
|
|
|
|
My latest one to crash with replanca error was after about 40 hours which on my machine is 4 or 5 zip files worth. This was after a restart but the model had been suspended and file - exit used to shut boinc down before hibernating the computer? Has anyone else had them go this far before crashing?
Dave
They usually die straight after the first trickle/zip for me
____________
BOINC blog |
|
|
|
|
|
The rate at which the number of tasks in progress is going down on the server page indicates there are still a lot of units falling over.
Dave |
|
|
|
|
|
All the recent ones I have had have failed, for a few days now.
Would be nice to have one not fail around the _2.zip point.
____________
|
|
|
|
|
The rate at which the number of tasks in progress is going down on the server page indicates there are still a lot of units falling over.
Dave
Once they've been sent out there probably isn't a lot the project can do. While it is possible for the project to abort in-progress tasks, the version of BOINC they are running on CPDN server-side may not support it. GPUgrid used to do it but then people complain about how their task got aborted after many hours crunching. The tasks will fail anyway, so its probably better just to let them die on their own.
____________
BOINC blog |
|
|
|
|
|
Every task I have had on my laptop for the last week or so has also failed. The ones I have checked seem to be of the "replanca" variety. However I am unable to obtain any new tasks, so it has been effectively idle for several days now.
Is there a problem with the supply of new tasks - possibly as a result of this issue?
____________
Brian |
|
|
|
|
[nedsram-cdl wrote:]Every task I have had on my laptop for the last week or so has also failed. The ones I have checked seem to be of the "replanca" variety. However I am unable to obtain any new tasks, so it has been effectively idle for several days now.
Is there a problem with the supply of new tasks - possibly as a result of this issue?
The work units in the queue affected by the REPLANCA problem have been withdrawn and results that are running are failing quickly, so the supply of new units has declined to zero and the total number of running results has reduced somewhat as well. No doubt someone is working on a new set of work units with a correct set of ancillary files and the queue will fill accordingly when that is done. We'll know it's fixed when that happens! |
|
|
|
|
|
I just lost a hadam3p_eu WU after the first zip file, probably do to the replanca error. There are 2 hadam3p_eu WU’s (hadam3_eu_ctvq_2005_1_008084837_0 and hadam3p_eu_cum6_2000_1_008085302_1) sitting on my machine, most likely from the same bad batch.
Should I abort them before they start or let the run till they crash? Are they from the same bad batch? How do I tell?
____________
|
|
|
|
|
I just lost a hadam3p_eu WU after the first zip file, probably do to the replanca error. There are 2 hadam3p_eu WU’s (hadam3_eu_ctvq_2005_1_008084837_0 and hadam3p_eu_cum6_2000_1_008085302_1) sitting on my machine, most likely from the same bad batch.
Should I abort them before they start or let the run till they crash? Are they from the same bad batch? How do I tell?
It looks like the 2 you mention were downloaded July 24th. Thus, they are likely bad. One of the work units that the tasks belong to have already had a task crash with a REPLANCA error. I'd abort them. |
|
|
|
|
|
hello everyone,
sorry but I have not had time to read this whole thread.
I'm crunching the following 4 wu and they seem to be returning zip files ok.
and I was wondering if it is ok to let them continue to run ?
hadam3p_pnw_c6nd_1993_1_008091178 - - Sent - - 26 Jul 2012 14:03:18 UTC
hadam3p_pnw_c75k_1968_1_008091170 - - Sent - - 26 Jul 2012 14:03:18 UTC
hadcm3n_o44o_2100_40_008085978 - - - - - Sent - - 25 Jul 2012 20:48:43 UTC
hadam3p_eu_alis_1998_1_008068421 - - - - Sent - - 19 Jul 2012 18:02:52 UTC
my computer id 948812
my account userid=910
thanks ,
Byron |
|
|
|
|
|
There's 3 separate problems, all from around the time that your models were sent.
In order of when they happened to mine:
Some will fail at around 9-10 hours, between zips 1 & 2
Some will fail at around 19-20 hours
Some will have files that "can't be found", and cause download failures
And there were also models that ran OK.
The first 2 were due to REPLANCA errors; an auxiliary file not having the correct number of data. The 3rd was an error with the path of a mirror server.
All models were deleted from the download pool, but there are still re-sends, caused by people not starting work that they received back then.
If you're running any of the failures you'll soon find out.
____________
Backups: Here |
|
|
|
|
and I was wondering if it is ok to let them continue to run ?
hadam3p_pnw_c6nd_1993_1_008091178 - - Sent - - 26 Jul 2012 14:03:18 UTC
hadam3p_pnw_c75k_1968_1_008091170 - - Sent - - 26 Jul 2012 14:03:18 UTC
hadcm3n_o44o_2100_40_008085978 - - - - - Sent - - 25 Jul 2012 20:48:43 UTC
hadam3p_eu_alis_1998_1_008068421 - - - - Sent - - 19 Jul 2012 18:02:52 UTC
my computer id 948812
my account userid=910
Looks like all 4 of them should continue on okay. None look to be in the bad batches. You've already made enough progress on them that they've gotten past the typical failure points for EU and PNS models. |
|
|
|
|
|
Thank you geophi and Les Bayliss for your reply
Yes all 4 seem to be continuing ok with no problems.
So I will let them continue to run to the end.
thanks,
Byron |
|
|
|
|
|
I just recently got a result error with the following stdout:
<core_client_version>6.10.58</core_client_version>
<![CDATA[
<stderr_txt>
Model crashed: INITTIME: Atmosphere basis time mismatch tmp/xaakm.pipe_dummy 2048
Leaving CPDN_Main::Monitor...
Called boinc_finish
</stderr_txt>
<message>
<file_xfer_error>
<file_name>hadam3p_eu_69wa_2000_1_008138105_0_1.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_69wa_2000_1_008138105_0_2.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_69wa_2000_1_008138105_0_3.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_69wa_2000_1_008138105_0_4.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_69wa_2000_1_008138105_0_5.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_69wa_2000_1_008138105_0_6.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_69wa_2000_1_008138105_0_7.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_69wa_2000_1_008138105_0_8.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_69wa_2000_1_008138105_0_9.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_69wa_2000_1_008138105_0_10.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_69wa_2000_1_008138105_0_11.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_69wa_2000_1_008138105_0_12.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_eu_69wa_2000_1_008138105_0_13.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
</message>
]]>
And also the following messages in the client:
8/12/2012 9:26:31 PM climateprediction.net Computation for task hadam3p_eu_69wa_2000_1_008138105_0 finished
8/12/2012 9:26:31 PM climateprediction.net Output file hadam3p_eu_69wa_2000_1_008138105_0_1.zip for task hadam3p_eu_69wa_2000_1_008138105_0 absent
8/12/2012 9:26:31 PM climateprediction.net Output file hadam3p_eu_69wa_2000_1_008138105_0_2.zip for task hadam3p_eu_69wa_2000_1_008138105_0 absent
8/12/2012 9:26:31 PM climateprediction.net Output file hadam3p_eu_69wa_2000_1_008138105_0_3.zip for task hadam3p_eu_69wa_2000_1_008138105_0 absent
8/12/2012 9:26:31 PM climateprediction.net Output file hadam3p_eu_69wa_2000_1_008138105_0_4.zip for task hadam3p_eu_69wa_2000_1_008138105_0 absent
8/12/2012 9:26:31 PM climateprediction.net Output file hadam3p_eu_69wa_2000_1_008138105_0_5.zip for task hadam3p_eu_69wa_2000_1_008138105_0 absent
8/12/2012 9:26:31 PM climateprediction.net Output file hadam3p_eu_69wa_2000_1_008138105_0_6.zip for task hadam3p_eu_69wa_2000_1_008138105_0 absent
8/12/2012 9:26:31 PM climateprediction.net Output file hadam3p_eu_69wa_2000_1_008138105_0_7.zip for task hadam3p_eu_69wa_2000_1_008138105_0 absent
8/12/2012 9:26:31 PM climateprediction.net Output file hadam3p_eu_69wa_2000_1_008138105_0_8.zip for task hadam3p_eu_69wa_2000_1_008138105_0 absent
8/12/2012 9:26:31 PM climateprediction.net Output file hadam3p_eu_69wa_2000_1_008138105_0_9.zip for task hadam3p_eu_69wa_2000_1_008138105_0 absent
8/12/2012 9:26:31 PM climateprediction.net Output file hadam3p_eu_69wa_2000_1_008138105_0_10.zip for task hadam3p_eu_69wa_2000_1_008138105_0 absent
8/12/2012 9:26:31 PM climateprediction.net Output file hadam3p_eu_69wa_2000_1_008138105_0_11.zip for task hadam3p_eu_69wa_2000_1_008138105_0 absent
8/12/2012 9:26:31 PM climateprediction.net Output file hadam3p_eu_69wa_2000_1_008138105_0_12.zip for task hadam3p_eu_69wa_2000_1_008138105_0 absent
8/12/2012 9:26:31 PM climateprediction.net Output file hadam3p_eu_69wa_2000_1_008138105_0_13.zip for task hadam3p_eu_69wa_2000_1_008138105_0 absent
Is it another kind of problem with the WUs? |
|
|
|
|
|
The files are missing because the model crashed soon after starting. So none of the output data files got created. It's BOINC complaining about not being able to find then.
Only the first couple of lines of the STDERR file are relevant.
____________
Backups: Here |
|
|
|
|
|
Another one that crashed...
Model crashed: INITTIME: Atmosphere basis time mismatch tmp/xaakm.pipe_dummy 2048
Leaving CPDN_Main::Monitor...
Regional yearly means requires 12 input files got 0
Wu name: hadam3p_pnw_2yuc_1975_1_008145549_1
Created: 15 Aug 2012
I would link to it but your Akismet anti-spam system thinks your own URL's are spam. The wuid is 8300673
____________
BOINC blog |
|
|
|
|
|
More of these REPLANCA errors in this UK Met Office Coupled Model Full Resolution Ocean WU created Friday http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=8395212
<core_client_version>7.0.28</core_client_version>
<![CDATA[
<message>
The device does not recognize the command. (0x16) - exit code 22 (0x16)
</message>
<stderr_txt>
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/pipe_dummy 2048
Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/pipe_dummy 2048
Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/pipe_dummy 2048
Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/pipe_dummy 2048
Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/pipe_dummy 2048
Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH tmp/pipe_dummy 2048
Sorry, too many model crashes! :-(
Called boinc_finish
</stderr_txt>
]]>
____________

Professor Desty Nova
Researching Karma the Hard Way |
|
|
mo.vForum moderator
 Send message Joined: Sep 29 04 Posts: 2267 Credit: 5,304,959 RAC: 2,077
|
|
Thanks, Professor. The REPLANCA errors have been reported to Andy and Jonathan. If one task in a WU crashes with REPLANCA, all the tasks in that WU will, and on all OSs.
____________
Cpdn news
5 CPDN READMEs |
|
|