climateprediction.net home page
hadcm3n restart from backup

hadcm3n restart from backup

Message boards : Number crunching : hadcm3n restart from backup
Message board moderation

To post messages, you must log in.

AuthorMessage
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,896,461
RAC: 649
Message 43923 - Posted: 3 Mar 2012, 23:32:28 UTC

I'm totally amazed. Restarted a failed hadcm3n from backup --put it on a virtual machine --
And it got past the fail point and is still running.
Never happened before every one I ever restarted from backup failed croaked at the fail point however far back I restored from.


Backups finally did some good .
This one
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/result.php?resultid=14098560

is actually working restored from backup and got past the fail point.

Totally amazed.

Are we still supposed to keep these restored losers running? It's already reported as failed.
Whee!
ID: 43923 · Report as offensive     Reply Quote
Profile Iain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,748,193
RAC: 3,811
Message 43925 - Posted: 4 Mar 2012, 11:09:26 UTC
Last modified: 4 Mar 2012, 14:56:12 UTC

It may not be good news. One of the failure modes is a 'zombie' model, in which the model will keep running until it is restarted: it will not, however, produce any trickle files or Zip uploads - so have a look for a trickle upload when it's run long enough to produce one. If it hasn't produced a file then it's useless. If there are trickles and Zip uploads then, as far as the project is concerned, the model is as good as any other.

[Edit: The model has produced four trickles since the crash, so it looks good.]

Having said that, I have had one success trying to evade a decade crash (at least so far - it may still crash at another decade). That involved restarting from the beginning, before it had even unzipped. My long-standing habit has been to run batches of models in parallel (i.e. starting at the same time, not interleaved), to then allow a new batch to download and suspend the new models. When the old batch has finished a backup is then made of the new batch, which will not even have unzipped. The original reason for doing that was that the backups are much smaller, but perhaps it may also be a work-around on this occasion (though massively inefficient for a multi-processor machine).
ID: 43925 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,896,461
RAC: 649
Message 43930 - Posted: 9 Mar 2012, 1:20:41 UTC

So far so good. Computed past the next decade ok.
Restore from backup good so far at +75%
Happy here.

ID: 43930 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,896,461
RAC: 649
Message 43947 - Posted: 14 Mar 2012, 4:28:17 UTC

Completed ok uploaded the last big zip file.
Task still shows a comp error on the web page but the data got uploaded.


ID: 43947 · Report as offensive     Reply Quote
Profile astroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 43949 - Posted: 14 Mar 2012, 18:33:04 UTC

Good job!

"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 43949 · Report as offensive     Reply Quote

Message boards : Number crunching : hadcm3n restart from backup

©2024 cpdn.org