climateprediction.net home page
something happened to my computer, all models crashed

something happened to my computer, all models crashed

Message boards : Number crunching : something happened to my computer, all models crashed
Message board moderation

To post messages, you must log in.

AuthorMessage
NewtonianRefractor

Send message
Joined: 22 May 08
Posts: 49
Credit: 2,335,997
RAC: 0
Message 41068 - Posted: 17 Nov 2010, 20:27:21 UTC
Last modified: 17 Nov 2010, 20:29:16 UTC

My computer: hostid 1109774 had all 3 models that were running (it's a triple core) crash within a minute of each other. When the computer tried to download new Workunits, there were some download errors.

I noticed this after several hours when I came back and tried to use the computer.

Also, something was messed up with the computer, as I could not load firefox. It would just use more and more ram until the process used over 2 GB of ram.

I rebooted the computer, and now it appears to run fine.

I was wondering if the exit statuses on the tasks can tell me what happened?

Also, I ended up resetting the project just in case.
ID: 41068 · Report as offensive     Reply Quote
Profile Thyme Lawn
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1283
Credit: 15,824,334
RAC: 0
Message 41070 - Posted: 17 Nov 2010, 22:54:16 UTC - in response to Message 41068.  

I was wondering if the exit statuses on the tasks can tell me what happened?

Two of the tasks (famous_s3eq_799_200_006670973_2 and hadcm3igeo_w179_2000_80_06761500_1) failed with maximum elapsed time exceeded. The run time for both was over 1,290,000,000 seconds; nearly 42 years!

The elapsed time is calculated in a timer thread which should trigger every 0.1 seconds. The only way the run time can get as large as it was on those tasks is if something has caused the sleep to start returning immediately.
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer
ID: 41070 · Report as offensive     Reply Quote
NewtonianRefractor

Send message
Joined: 22 May 08
Posts: 49
Credit: 2,335,997
RAC: 0
Message 41083 - Posted: 19 Nov 2010, 7:19:15 UTC - in response to Message 41070.  
Last modified: 19 Nov 2010, 7:20:15 UTC

I was wondering if the exit statuses on the tasks can tell me what happened?

Two of the tasks (famous_s3eq_799_200_006670973_2 and hadcm3igeo_w179_2000_80_06761500_1) failed with maximum elapsed time exceeded. The run time for both was over 1,290,000,000 seconds; nearly 42 years!

The elapsed time is calculated in a timer thread which should trigger every 0.1 seconds. The only way the run time can get as large as it was on those tasks is if something has caused the sleep to start returning immediately.


So what could cause that? I am running a fairly aggressive overclock, but I checked the computer right after by running Prime95 for six hours, and it was stable.
ID: 41083 · Report as offensive     Reply Quote
Profile JIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,363,583
RAC: 5,022
Message 41084 - Posted: 19 Nov 2010, 7:35:26 UTC - in response to Message 41083.  

Dear NewtonianRefractor

Running Prime 95 for only 6 hours may not be long enough to get a true picture of stability on an overclocked computer. Twentyfour hour would be a better test of stability.

ID: 41084 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 41085 - Posted: 19 Nov 2010, 8:26:24 UTC - in response to Message 41083.  

a fairly aggressive overclock



Say no more.

ID: 41085 · Report as offensive     Reply Quote
NewtonianRefractor

Send message
Joined: 22 May 08
Posts: 49
Credit: 2,335,997
RAC: 0
Message 41087 - Posted: 19 Nov 2010, 10:10:09 UTC

Alright, I decided to shutdown Boinc and run Prime95 over the weekend. I guess I will run it in blend mode so that there is some load on the RAM as well as on the CPU.

My overclock is from 3.1 GHz to 3.7 GHz with a voltage of 1.5V.

Stock the CPU had a voltage of 1.32V. I could only overclock the CPU to 3.4 GHz without raising the voltage.

I have been running the CPU with these elevated frequencies for over a month now, and the computer appeared stable, no problems before this. I did run Prime95 for about a day before, when I was initially checking my overclock, but I will recheck it now.
ID: 41087 · Report as offensive     Reply Quote
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 41130 - Posted: 21 Nov 2010, 23:40:42 UTC
Last modified: 21 Nov 2010, 23:48:31 UTC

I've looked at the stderr + for both of the crashed models.

hadcm3igeo_w179_2000_80_06761500_1 produced lockfile errors. As far as I know models can't survive this. It's a very unusual type of error and shouldn't occur with the recent versions of Boinc. If it was caused by your Boinc version every one of your models would crash with the same error, but this isn't happening.

famous_s3eq_799_200_006670973_2 produced INVALID THETA messages twice. If this INVALID THETA occurs 5 or 6 times together it usually indicates a problem inherent in the model which leads to values that are out-of-normal-range. I think your computer generated the out-of-range value.
Cpdn news
ID: 41130 · Report as offensive     Reply Quote
Belfry

Send message
Joined: 19 Apr 08
Posts: 179
Credit: 4,306,992
RAC: 0
Message 41136 - Posted: 22 Nov 2010, 15:09:49 UTC
Last modified: 22 Nov 2010, 15:38:12 UTC

Prime95 is not the last word on stability. X86 CPU's contain many different hardware pathways and registers to handle the myriad of x86 instructions. Since Prime95 runs a relatively simple algorithm over and over, it simply can't test them all. Fourteen months ago I undervolted my Phenom II from 1.30 to 1.20 volts, which could run Mprime--the Linux version of Prime95--perfectly for 24 hours, but hadsm3 models would crash sooner or later. Yet four hadam3p's would run to completion. The key difference is hadam3p uses SSE2 instructions exclusively for crunching, whereas hadsm3 uses x87 under Linux (under Windows it's a mix of SSE2 and x87--based on my investigations into hadsm3 "iceworlds".) I'm guessing famous and hadcm3 models also contain quite a bit of x87 code.

Nowadays I don't consider a machine stable until it's run three different CPDN models to completion at full load.

Edit: it took quite a while to diagnose the source of my crashes to x87 in hadsm3's--because all running models would crash shortly afterwards, including other BOINC projects.
ID: 41136 · Report as offensive     Reply Quote

Message boards : Number crunching : something happened to my computer, all models crashed

©2024 cpdn.org