climateprediction.net home page
CRASHED HADCM3

CRASHED HADCM3

Message boards : Number crunching : CRASHED HADCM3
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile JIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022
Message 45037 - Posted: 5 Oct 2012, 22:50:13 UTC

I seem to have developed a problem with hadcm3 models. I have completed several in the past, but, the last 2 have crashed after the being stopped to make backups. I don�t understand this as I stopped manager the right way by first suspending the model and then closing the manager.

The stderr:

<core_client_version>6.10.58</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
=0, selfPID=0, iMonCtr=0
Model crash detected, will try to restart...
Controller:: CPDN process is not running, exiting, bRetVal = 0, checkPID=0, selfPID=0, iMonCtr=0
Model crash detected, will try to restart...
Controller:: CPDN process is not running, exiting, bRetVal = 0, checkPID=0, selfPID=0, iMonCtr=0
Model crash detected, will try to restart...
Controller:: CPDN process is not running, exiting, bRetVal = 0, checkPID=0, selfPID=0, iMonCtr=0
Model crash detected, will try to restart...
Controller:: CPDN process is not running, exiting, bRetVal = 0, checkPID=0, selfPID=0, iMonCtr=0
Model crash detected, will try to restart...
Controller:: CPDN process is not running, exiting, bRetVal = 0, checkPID=0, selfPID=0, iMonCtr=0
Model crash detected, will try to restart...
Controller:: CPDN process is not running, exiting, bRetVal = 0, checkPID=0, selfPID=0, iMonCtr=0
Model crash detected, will try to restart...
Controller:: CPDN process is not running, exiting, bRetVal = 0, checkPID=0, selfPID=0, iMonCtr=0
Model crash detected, will try to restart...
Controller:: CPDN process is not running, exiting, bRetVal = 0, checkPID=0, selfPID=0, iMonCtr=0
Model crash detected, will try to restart...
Controller:: CPDN process is not running, exiting, bRetVal = 0, checkPID=0, selfPID=0, iMonCtr=0
Model crash detected, will try to restart...
Controller:: CPDN process is not running, exiting, bRetVal = 0, checkPID=0, selfPID=0, iMonCtr=0
Model crash detected, will try to restart...
Controller:: CPDN process is not running, exiting, bRetVal = 0, checkPID=0, selfPID=0, iMonCtr=0
Model crash detected, will try to restart...
Controller:: CPDN process is not running, exiting, bRetVal = 0, checkPID=0, selfPID=0, iMonCtr=0
Model crash detected, will try to restart...
Controller:: CPDN process is not running, exiting, bRetVal = 0, checkPID=0, selfPID=0, iMonCtr=0
Model crash detected, will try to restart...


The stderr seems to be saying that an important part of the program did not restart after it was stopped.

ID: 45037 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 45038 - Posted: 5 Oct 2012, 22:54:22 UTC - in response to Message 45037.  

Yes, they are touchy little things.
Better luck next time.


Backups: Here
ID: 45038 · Report as offensive     Reply Quote
Profile Iain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,736,575
RAC: 3,351
Message 45039 - Posted: 6 Oct 2012, 0:05:20 UTC

Jim,

Perversely, the success rate for HADCM3N models on my machines went up when I stopped making backups. I now run these models completely undisturbed: no suspends, no stops, no backups, no network activity. I make one backup at the start of each model batch when they haven't even unzipped (i.e. they are suspended immediately after downloading). On crashes (other than 'negative theta') then I found by repeated experiments that models would only suceed if restarted from the beginning. Since this method has been adopted no model has crashed. When the model has finished then network activity is turned on again for one large upload.

This is unsuitable for most of the machines I could use, since they inevitably have some work use, which takes priority. So, in practice, only one Mac that can be set aside for a three week session is used for HADCM3N. The other machines run HADAM3P when available (machines and models).

Iain
ID: 45039 · Report as offensive     Reply Quote
Profile JIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022
Message 45041 - Posted: 7 Oct 2012, 4:26:55 UTC

Thanks for the advise Iain. I am not sure if it would work on my machine. At more than 900 hours to complete a CM model I don�t think I could go all that time without shutting it down at least once. My other machine runs CM just fine. I finished one just yesterday.

When the problem machine has finished running all of the Hadam3p Wu�s now on it I plan to reset the project. Maybe that will solve whatever the problem is.

ID: 45041 · Report as offensive     Reply Quote

Message boards : Number crunching : CRASHED HADCM3

©2024 cpdn.org