climateprediction.net home page
Ocean model crashed.

Ocean model crashed.

Message boards : Number crunching : Ocean model crashed.
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Profile JIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022
Message 42852 - Posted: 3 Sep 2011, 13:57:36 UTC

Work Unit hadcm3n_t5wx_1980_40_007414564_0 crashed after timestep 259,200. Reason unknown. The Stderr is shown below.

core_client_version>6.10.58</core_client_version>
<![CDATA[
<message>
- exit code 193 (0xc1)
</message>
<stderr_txt>
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Quit request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Quit request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Quit request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
CPDN Monitor - Quit request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
CPDN Monitor - Quit request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
CPDN Monitor - Quit request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Quit request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
CPDN Monitor - Quit request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Quit request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Quit request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
CPDN Monitor - Quit request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=4760, iMonCtr=1
Model crash detected, will try to restart...
CPDN Monitor - Quit request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
CPDN Monitor - Quit request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Quit request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
CPDN Monitor - Quit request from BOINC...


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x77163A93 read attempt to address 0x40E476BC

Engaging BOINC Windows Runtime Debugger...

No Process Handle
Worker:: CPDN process is not running, exiting, bRetVal = 1, checkPID=4312, selfPID=4312, iMonCtr=1


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x76F57353 read attempt to address 0xFFFFFFF8

Engaging BOINC Windows Runtime Debugger...

Cannot serialize file C:\ProgramData\BOINC/projects/climateprediction.net/hadcm3n_t5wx_1980_40_007414564/dataout/shmem_restart.day
Signal 11 received, exiting...
Called boinc_finish

</stderr_txt>
]]>

Can anyone make sense of this? What is an Access Violation? What does Signal 11 mean?

ID: 42852 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 42855 - Posted: 3 Sep 2011, 21:14:20 UTC - in response to Message 42852.  

Wiki article on signal 11.

There's a lot of Suspends. 2 possibilities:
1) Something, possibly the anti virus, is blocking access to some file at a critical moment
2) You have the 'newish' BOINC option to stop processing when other program usage is high, still set to the default of 25%. Or perhaps one of the several other 'slow down' options is/are being used.
This could affect the programs, which don't like being interrupted.


Backups: Here
ID: 42855 · Report as offensive     Reply Quote
Profile JIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022
Message 42856 - Posted: 4 Sep 2011, 4:48:00 UTC - in response to Message 42855.  
Last modified: 4 Sep 2011, 4:53:31 UTC

I doubt that it is the Norton Antivirus that is causing the problem. I excluded from scans the Boinc folders in both Programs and in the ProgramData folders. I did this in both regular scans and in the so-called Sonar competent.

edit:
Your right, the stop work if the CPU usage is to high was set at 25%. I have reset it to 0.
ID: 42856 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4532
Credit: 18,789,329
RAC: 19,803
Message 42857 - Posted: 4 Sep 2011, 9:53:11 UTC - in response to Message 42784.  

Dave,

I don't know if this is right, but the integer benchmark score is HUGE for that processor. Is it significantly overclocked?


I have only just got round to looking at integer benchmark scores for other computers and I now know just how huge the score is. What a shame that isn't reflected in the speed I get through work units!

Dave
ID: 42857 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 42859 - Posted: 5 Sep 2011, 21:02:50 UTC

Jim

I've come up with a new idea about the 'error 193'/25% failures.


Backups: Here
ID: 42859 · Report as offensive     Reply Quote
3rkko

Send message
Joined: 12 Feb 08
Posts: 66
Credit: 4,877,652
RAC: 0
Message 42919 - Posted: 16 Sep 2011, 18:15:28 UTC

This model http://climateapps2.oerc.ox.ac.uk/cpdnboinc/result.php?resultid=13367975
crashed at 0 time with an error I have not seen before "INITTIME: Ocean basis time mismatch".

<core_client_version>6.10.58</core_client_version>
<![CDATA[
<message>
The device does not recognize the command. (0x16) - exit code 22 (0x16)
</message>
<stderr_txt>
Model crashed: INITTIME: Ocean basis time mismatch
Model crashed: INITTIME: Ocean basis time mismatch
Model crashed: INITTIME: Ocean basis time mismatch
Model crashed: INITTIME: Ocean basis time mismatch
Model crashed: INITTIME: Ocean basis time mismatch
Model crashed: INITTIME: Ocean basis time mismatch
Sorry, too many model crashes! :-(
Called boinc_finish
</stderr_txt>
]]>
ID: 42919 · Report as offensive     Reply Quote
Profile Greg van Paassen

Send message
Joined: 17 Nov 07
Posts: 142
Credit: 4,271,370
RAC: 0
Message 42924 - Posted: 16 Sep 2011, 22:52:03 UTC - in response to Message 42919.  

Hi 3rkko,

I had a couple of these, too.

Some of the yxxx series were not configured correctly, unfortunately.

Fortunately, they crash straight away, so the only loss is the cost of the download, not weeks of CPU time. :)
ID: 42924 · Report as offensive     Reply Quote
Previous · 1 · 2

Message boards : Number crunching : Ocean model crashed.

©2024 cpdn.org