climateprediction.net home page
100 hour bug?

100 hour bug?

Message boards : Number crunching : 100 hour bug?
Message board moderation

To post messages, you must log in.

AuthorMessage
transient

Send message
Joined: 3 Oct 06
Posts: 43
Credit: 8,017,057
RAC: 0
Message 42486 - Posted: 29 Jun 2011, 5:23:34 UTC

These 3 models failed on my computer after running for a bit over 100 hours. The last trickle was for timestep 233280. Coincidence?

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=12991505
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=12990998
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=12990504

They failed with the "invalid theta" mesaage.
ID: 42486 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 42487 - Posted: 29 Jun 2011, 5:47:43 UTC - in response to Message 42486.  

This is explained 2 posts before yours, in Hadcm3n INVALID THETA ?. :)


Backups: Here
ID: 42487 · Report as offensive     Reply Quote
Profile JIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022
Message 42488 - Posted: 29 Jun 2011, 6:43:56 UTC

Hadcm3n_s3h8_1940_40_007299462_1 crashed due to “invalid theta” after timestep 233,000 (about 150 hours). These seem to be as unstable as the Famous models were. Does anyone know what percentage of them are failing.

Because of the length of the HadCMn models if they are to unstable the attrition rate could be so high that they are not worth running.


ID: 42488 · Report as offensive     Reply Quote
Profile astroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 42489 - Posted: 29 Jun 2011, 6:54:03 UTC

The ' s* ' series was/is unstable. New work is from the ' t* ' series and should be okay. (I hope so! I have several of them.)

The fresh lot can be identified in the ID by ' _t***_ ' and we have high hopes for them . . .
"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 42489 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,896,461
RAC: 649
Message 42490 - Posted: 29 Jun 2011, 8:09:42 UTC - in response to Message 42489.  
Last modified: 29 Jun 2011, 8:13:53 UTC

The server status page keeps growing from a few hundred to 2500 as of now -- hoping these t*** series run longer
In any case the thing to do is
keep on crunching

Status page available wu keeps growing these last few hours -- hope these new wu run to completion
ID: 42490 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 42491 - Posted: 29 Jun 2011, 8:27:17 UTC - in response to Message 42490.  

ALL climate models run to completion. It's just that "completion" isn't necessarily the full possible length. :)

It's stated on the RAPIT/RAPID explanation page that some models are expected to fail, because of the extreme forcings and parameter values used.
It all depends on how adventurous the researchers decide to get. :)


Backups: Here
ID: 42491 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,896,461
RAC: 649
Message 42492 - Posted: 29 Jun 2011, 9:10:37 UTC - in response to Message 42491.  

Very clear -- models that die with "NEGATIVE THETA" or other similar failures
are NOT wasted -- the researchers can learn what possible combinations are consistent with plausible scenarios, and what are not possible.
Please keep processing whatever models you get -- again --
keep on crunching


ALL climate models run to completion. It's just that "completion" isn't necessarily the full possible length. :)

It's stated on the RAPIT/RAPID explanation page that some models are expected to fail, because of the extreme forcings and parameter values used.
It all depends on how adventurous the researchers decide to get. :)



ID: 42492 · Report as offensive     Reply Quote
Profile JIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022
Message 42493 - Posted: 29 Jun 2011, 14:45:31 UTC

Thanks for the info. I still have 1 of the “S” series WU’s running on my slower machine at about 19%. I was wondering, given the instability of this series, whether it was worth continuing it. Now that I know that even the “failures” yield useful data I will continue to crunch it as far as it goes.

ID: 42493 · Report as offensive     Reply Quote
Profile Thyme Lawn
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1283
Credit: 15,824,334
RAC: 0
Message 42495 - Posted: 29 Jun 2011, 16:58:05 UTC - in response to Message 42493.  

Thanks for the info. I still have 1 of the “S” series WU’s running on my slower machine at about 19%. I was wondering, given the instability of this series, whether it was worth continuing it. Now that I know that even the “failures” yield useful data I will continue to crunch it as far as it goes.

All of the hadcm3n_sXXX_1940_40_ series workunits have been cancelled on the server so I'd abort it Jim.
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer
ID: 42495 · Report as offensive     Reply Quote
Profile tullio

Send message
Joined: 6 Aug 04
Posts: 264
Credit: 965,476
RAC: 0
Message 42496 - Posted: 29 Jun 2011, 18:05:49 UTC
Last modified: 29 Jun 2011, 18:06:27 UTC

I aborted my "s" WU after 57 hours of running in high priority and got one "t". Let's hope it can get some result. My other 5 projects all share one core, including a Virtual Machine from CERN which does not run in high priority.
Tullio
ID: 42496 · Report as offensive     Reply Quote
Profile JIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022
Message 42498 - Posted: 29 Jun 2011, 21:42:28 UTC - in response to Message 42495.  

All of the hadcm3n_sXXX_1940_40_ series workunits have been cancelled on the server so I'd abort it Jim.[/quote]

Thanks for the advise Thyme. I have aborted the “S”.

ID: 42498 · Report as offensive     Reply Quote
transient

Send message
Joined: 3 Oct 06
Posts: 43
Credit: 8,017,057
RAC: 0
Message 42501 - Posted: 30 Jun 2011, 4:41:51 UTC - in response to Message 42487.  

This is explained 2 posts before yours, in Hadcm3n INVALID THETA ?. :)



I just thought it was weird that all 3 failed at the exact same point.
ID: 42501 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 42504 - Posted: 30 Jun 2011, 8:28:53 UTC - in response to Message 42501.  

Apart from unintentional failures such as those mentioned a few hours ago in the News thread, models are created in batches, with each one having a slight offset in starting values to the one that preceded it, and to the one that follows it.

If someone gets a bunch of datasets that are of these adjacent values, then if one fails at a certain point, it's possible that others around it in parameter space will fail at a similar point.
Luck of the draw.

As they said during WWII, (or were ready to say): Stay calm and carry on.
Or, as some people say these days: Stay calm and carry yarn.
(That's a knitting joke. :) )


Backups: Here
ID: 42504 · Report as offensive     Reply Quote

Message boards : Number crunching : 100 hour bug?

©2024 cpdn.org