climateprediction.net home page
Fatal error in last minute of WU, but still reports success. Admin, please examine!

Fatal error in last minute of WU, but still reports success. Admin, please examine!

Questions and Answers : Unix/Linux : Fatal error in last minute of WU, but still reports success. Admin, please examine!
Message board moderation

To post messages, you must log in.

AuthorMessage
Andy Lee Robinson

Send message
Joined: 11 Dec 05
Posts: 6
Credit: 1,468,014
RAC: 0
Message 29079 - Posted: 31 May 2007, 11:53:31 UTC

I\'ve just finished one which reported success but the details in the results file suggest otherwise, and I didn\'t see anything uploaded.

3 months processing and all this in the last minute...
I\'d like to know if it really is OK, and if the files can be salvaged and uploaded somehow.

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=6312426

<core_client_version>5.5.0</core_client_version>
<stderr_txt>
(null): cannot open input file dataout/atmos_restart.day
(null): cannot open input file dataout/ocean_restart.day
... [deleted] ...
pp2netcdf crashed: Error in getting file type
Error in converting file dataout/b6hcfo.pjk6c10 to netcdf format.

pp2netcdf crashed: Error in getting file type
Error in converting file dataout/b6hcfo.pik6c10 to netcdf format.

pp2netcdf crashed: Error in getting file type
Error in converting file dataout/b6hcfo.pfk6c10 to netcdf format.

pp2netcdf crashed: Error in getting file type
Error in converting file dataout/b6hcfa.phk6c10 to netcdf format.

pp2netcdf crashed: Error in getting file type
Error in converting file dataout/b6hcfa.pgk6c10 to netcdf format.

pp2netcdf crashed: Error in getting file type
Error in converting file dataout/b6hcfa.pek6c10 to netcdf format.

pp2netcdf crashed: Error in getting file type
Error in converting file dataout/b6hcfa.pdk6c10 to netcdf format.

(null): cannot open input file dataout/ocean_restart.day

Model crashed: umshell1.f: READ_FLH: I/O error
(null): cannot open input file dataout/ocean_restart.day

Model crashed: umshell1.f: READ_FLH: I/O error
(null): cannot open input file dataout/ocean_restart.day

Model crashed: umshell1.f: READ_FLH: I/O error
(null): cannot open input file dataout/ocean_restart.day

Model crashed: umshell1.f: READ_FLH: I/O error
Fatal crash! :-(

</stderr_txt>

ID: 29079 · Report as offensive     Reply Quote
Profile Strathpeffer
Avatar

Send message
Joined: 9 Jan 07
Posts: 497
Credit: 342,899
RAC: 0
Message 29083 - Posted: 31 May 2007, 16:42:22 UTC
Last modified: 31 May 2007, 16:46:38 UTC

Yes Andy, it did finish, well done - result and graph here. Version 5.15 of the climate software shocks everyone by reporting every single error message since the beginning of the model, when the model completes! Looks as if you restored it from a backup at some point? (If so, well done for that too!)

Now that I look at it again, you haven\'t actually been granted the usual amount of credit for it, so maybe there\'s a missing trickle or something, but your graph is certainly showing a complete run. ;-)
Visit the Scotland team
ID: 29083 · Report as offensive     Reply Quote
[B^S] mavau

Send message
Joined: 30 Aug 04
Posts: 142
Credit: 9,936,132
RAC: 0
Message 29086 - Posted: 31 May 2007, 18:36:28 UTC

My latest completed model (on Vista) has the same type of error messages (a bit of a shock for me too, at first). When it finished, the last trickle was not credited immediately. Things got sorted out on the next database update.
Since the database was down for a while starting yesterday afternoon till this morning and trickles haven\'t been updated since then, I can see why you could be missing more.
There used to be missing trickles issues some time ago. If I remember correctly, they are taken into account as soon as the next one comes in and/or the database is updated.

Forum search Site search
ID: 29086 · Report as offensive     Reply Quote
Andy Lee Robinson

Send message
Joined: 11 Dec 05
Posts: 6
Credit: 1,468,014
RAC: 0
Message 29097 - Posted: 1 Jun 2007, 10:10:49 UTC - in response to Message 29083.  

Yes Andy, it did finish, well done - result and graph here. Version 5.15 of the climate software shocks everyone by reporting every single error message since the beginning of the model, when the model completes! Looks as if you restored it from a backup at some point? (If so, well done for that too!)

Now that I look at it again, you haven\'t actually been granted the usual amount of credit for it, so maybe there\'s a missing trickle or something, but your graph is certainly showing a complete run. ;-)


Thanks very much for your reassurance - I have another one on the other core to finish in 5 hours time, so looking forward to that too!
I\'m surprised that it didn\'t seem to upload everything on completion.

Yes, it is quite an achievement to actually complete a WU, I tried a few times on my overclocked Core2, but after a few weeks a crash would happen, something would get corrupted and the WU would abort :-(
This time I ran it on my linux production webserver which is quite lightly loaded and stable, (as it has to be!) and the WUs survived. I tried to just leave it alone as much as possible, and not even sneeze in the general vicinity!
It might be a good idea to award a substantial credit prize on successful completion.

I hadn\'t restored a backup on the machine, but upgraded the kernel a few times so requiring a reboot.

Once the last 5.15 WU has completed, should I detach and reattach to clean out the folder and prepare for the new app?

Cheers,
Andy.
ID: 29097 · Report as offensive     Reply Quote
Profile astroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 29124 - Posted: 2 Jun 2007, 22:54:28 UTC - in response to Message 29097.  


It might be a good idea to award a substantial credit prize on successful completion.

Nice idea but it can\'t be done. CPDN awards credit as the Run progresses. It is intended that all boinc Projects award the same amount of credit for equal amounts of work. So...

If a CPDN Run bombs somewhere along the way, the participant still gets par-value credit for work done.

If a Run finishes, full credit will have been given. If CPDN then tossed-in a bonus, the theoretical balance among Projects would be skewed. It might draw additional participants to CPDN but I doubt it would please leaders of other Projects.

Congratulations on your success. Your effort contributed significantly to the science. Thanks for participating and I hope we see you around for more. (Note: New options are being tested, shorter-running than the current Coupled Model. Other Models are in various stages of planning and development, so it will be an interesting place to be for quite awhile.)

"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 29124 · Report as offensive     Reply Quote
Andy Lee Robinson

Send message
Joined: 11 Dec 05
Posts: 6
Credit: 1,468,014
RAC: 0
Message 29125 - Posted: 3 Jun 2007, 0:10:17 UTC - in response to Message 29124.  

Nice idea but it can\'t be done. CPDN awards credit as the Run progresses. It is intended that all boinc Projects award the same amount of credit for equal amounts of work. So...

Well, in principle yes, but in practice I suspect a little more generosity wouldn\'t go amiss as these WUs are about 1000x longer than any others and require a lot of patience, commitment and stamina to see through!


If a CPDN Run bombs somewhere along the way, the participant still gets par-value credit for work done.

Yes, but this is also a negative thing which doesn\'t give so much incentive to take care of the task!


If a Run finishes, full credit will have been given. If CPDN then tossed-in a bonus, the theoretical balance among Projects would be skewed. It might draw additional participants to CPDN but I doubt it would please leaders of other Projects.

Well, perhaps avoiding churn and keeping the existing participants interested may be more significant than bringing in new ones that then just drop out after a while!


Congratulations on your success. Your effort contributed significantly to the science. Thanks for participating and I hope we see you around for more. (Note: New options are being tested, shorter-running than the current Coupled Model. Other Models are in various stages of planning and development, so it will be an interesting place to be for quite awhile.)


Thanks - I have a nice warm fuzzy feeling now at actually having got all the way through two of these monsters... :-) the sulphur runs last year were much shorter, but I still had difficulty keeping a machine stable enough to run continuously while trying to survive occasional power outages, developing applications which could do all sorts of unpredictable things, and rendering animations etc.

I think a greater degree of granularity would help overall, say distributing 10 year pieces - you can combine them as they come in, though there isn\'t the same magnitude of satisfaction on completion! ;-)
Also, optimisation for the significant numbers of SSEn+ enabled processors (of course without losing sight of accuracy) and maybe even a PS3 version, which I think would be a major feat! Conceivably they could do a WU in about a week, if single precision could be fudged to produce acceptable results, though would still be useful in double precision mode. I guess the next version of the Cell will do DP just as quick as current SP anyway, so worth a thought!

I\'m very concerned about climate change, and look forward to learning about your developments and of any improvements in model capability and code optimisation.

ID: 29125 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 29126 - Posted: 3 Jun 2007, 1:10:48 UTC
Last modified: 3 Jun 2007, 1:12:02 UTC

The \"granularity\" is at 40 years. This is what the restart dumps are for.

If you want a preview of new models that are coming soon, you can join the beta testing, or get a vague idea from the post in this page, dated Tue Feb 27, 2007 11:12 pm
The optimised models are already available as version 5.40 TCMs.

As for getting more credits for your models, dream on. It\'s not going to happen.


Backups: Here
ID: 29126 · Report as offensive     Reply Quote
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 29244 - Posted: 14 Jun 2007, 1:12:07 UTC

Hi Andy

The 10-year, or 40-year sections of the model can\'t just be run separately and then combined. Every model has start conditions which are the values for the parameters. Different for each model. But as the model progresses, the conditions change and the changes are cumulative.

So to run a model from, say, Dec 2000 you need the results up to the end of Nov 2000 (ie the restart dump), and you\'d need to wait until another computer had completed it up to that point. May as well do it all on one computer.


Cpdn news
ID: 29244 · Report as offensive     Reply Quote

Questions and Answers : Unix/Linux : Fatal error in last minute of WU, but still reports success. Admin, please examine!

©2024 cpdn.org