climateprediction.net home page
Optimised I/O 5.44 problem?

Optimised I/O 5.44 problem?

Message boards : Number crunching : Optimised I/O 5.44 problem?
Message board moderation

To post messages, you must log in.

AuthorMessage
Ant B

Send message
Joined: 29 Mar 06
Posts: 8
Credit: 2,793,692
RAC: 0
Message 30777 - Posted: 1 Oct 2007, 22:47:43 UTC

So far my crunching has been hassle-free - BBC coupled model and a few slab models (except for one that froze at 95% odd). Now, since downloading two 5.44 models I have been seeing the \'blue screen of death\' on my W2K machine. First time I have seen that in over 4 years. This machine has been running steadily since July 07. Messages (on screen, in memory dump file and event viewer) mentioned boot sector corruption, ntoskrnl corruption, hard drive failure, i/o conflicts and the like. At one point it even said the bios was corrupt. This only after shutdown - the machine runs without a hitch otherwise. Am I right in suspecting the new model perhaps? I have suspended the models and if nothing untoward happens in the next week I will post an update, but would be glad to hear comments or whether anyone else has had similar experiences.
ID: 30777 · Report as offensive     Reply Quote
Profile Iain Inglis

Send message
Joined: 9 Jan 07
Posts: 467
Credit: 14,549,176
RAC: 317
Message 30779 - Posted: 2 Oct 2007, 10:47:08 UTC

I had a similar experience running over several months: I thought the USB controller had gone, all sorts of things - all the diagnostics passed fine. Eventually, a memory diagnostic finally failed and I realised it was a duff memory stick. I had 2GB (like you), in two 1 GB sticks. With a replacement stick (under warranty - I was amazed!), there or now no BSODs and it\'s crunching happily.

Version 5.44 has a bigger memory footprint than earlier versions.

If you have two (or more) RAM sticks and they can run unpaired, then try swapping each in and out to see whether one in particular makes a difference.
ID: 30779 · Report as offensive     Reply Quote
Profile MikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 30780 - Posted: 2 Oct 2007, 12:37:58 UTC


It might be worth running a stress-testing tool such as memCheck86, Prime95 or Orthos, for a couple of days.

I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 30780 · Report as offensive     Reply Quote
Ant B

Send message
Joined: 29 Mar 06
Posts: 8
Credit: 2,793,692
RAC: 0
Message 30807 - Posted: 4 Oct 2007, 21:30:13 UTC - in response to Message 30780.  

Thanks for the comments. I am still slowly working things out. It definitely remains a problem of disk corruption at shutdown. I do not doubt that memory faults can cause this. However, it runs for days faultlessly if I don\'t shut it down. I do not know enough to know whether this is meaningful or not.
So far: Memtest86 has run overnight, 7 times through the cycles, with no indication of fault. Maybe I should do it for longer.
Another mem stress tool within windows has run for a day or so - while I let one model run - using up all the spare memory. All OK, no memory faults reported.
I haven\'t changed the physical configuration of memory. There was only one chip on board when it first happened - which I used as an excuse to get another. Evidently this did not fix the error, and I have left well alone.
Before I posted earlier, I did try putting the hard drive on a different controller while it was doing its intermittent crashing, because I suspected it may be something odd like a loose plug - it\'s an IDE drive on an adapter to a SATA port in IDE emulation mode (according to the BIOS). I put it staight on an IDE cable - no better. Changing the controller driver just seemed to make things worse.
So now everything is back to where it was and I am just exiting BOINC before shutdown every time. So far so good. I am wondering about changing the frequency of writing to the hard drive in my preferences settings - any ideas on whether this may make a difference?
ID: 30807 · Report as offensive     Reply Quote
DJStarfox

Send message
Joined: 27 Jan 07
Posts: 300
Credit: 3,288,263
RAC: 26,370
Message 30810 - Posted: 5 Oct 2007, 0:00:03 UTC - in response to Message 30807.  

Ant B:

A few important things that everyone has seemed to miss so far:
1) CPU temperatures - should be below 60 C for stability;
2) HD temperatures - high temps can cause corruption; I\'ve experienced it!
3) Pagefile usage - high pagefile usage cause excessive disk access during shutdown; do you have other apps running? Did you check for memory leaks (mem usage growing all the time)?
4) Any other changes in the system? Windows Update, etc.? Are you running service pack 4?
5) As something to try last, disable write-behind caching on your HD. See your Disk properties under Device Manager.

If this is too technical, let me know and I\'ll try to simplify/explain better.
ID: 30810 · Report as offensive     Reply Quote
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 30812 - Posted: 5 Oct 2007, 4:37:44 UTC

Hi Ant. You said

\'I am just exiting BOINC before shutdown every time\'

We must all always exit from boinc before shutting the computer down.

If we repeatedly turn off the computer with boinc running, sooner or later the model will crash - a specific boinc error occurs. But I don\'t think it could cause the problems you mention - it would only crash the model.


Cpdn news
ID: 30812 · Report as offensive     Reply Quote
Ant B

Send message
Joined: 29 Mar 06
Posts: 8
Credit: 2,793,692
RAC: 0
Message 30825 - Posted: 5 Oct 2007, 19:33:20 UTC - in response to Message 30810.  

Ant B:

A few important things that everyone has seemed to miss so far:
1) CPU temperatures - should be below 60 C for stability;
2) HD temperatures - high temps can cause corruption; I\'ve experienced it!
3) Pagefile usage - high pagefile usage cause excessive disk access during shutdown; do you have other apps running? Did you check for memory leaks (mem usage growing all the time)?
4) Any other changes in the system? Windows Update, etc.? Are you running service pack 4?
5) As something to try last, disable write-behind caching on your HD. See your Disk properties under Device Manager.

If this is too technical, let me know and I\'ll try to simplify/explain better.


Thanks everyone for the helpful comments
As for the hardware issues:

CPU temperatures - Does anyone know a utility that works with my ASRock Conroe 1333 DVI board? I had a utility on my old machine that ran smoothly as a plug-in through mmc. I am not sure this one is running hot, but it\'s stable for long periods at full load, and has been since I put it together in June.

HD temperatures - same problem with no hardware monitor I know of. This hard drive is a year or two old and, until a week after the new model, ran flawlessly. I haven\'t pursued this line of enquiry because there is so little disc activity related to the models compared with other times / apps, and these cause no symptoms.

Pagefile use I know nothing about - where can I check this? I do know that memory use is stable. No leaks according to task manager. The models use only 100MB each, and I have nearly 1.5GB unsued physical memory even with two models running. Swapfile size is left on automatic. Other apps are running, but just domestic stuff which I have run for ages. IE6, Office, etc. Firewall and antispyware scans - but no antivirus.

System is up to date and not recently changed, apart from another 1GB memory chip after the first crash (which didn\'t help). SP4 and all the windows updates installed automatically. Drivers are up to date according to driveragent.

I thought of hard drive caching, and found myself frustrated. The option is not visible on the hard drive properties. I thought I was going nuts, but it is present on the spare drive I plugged in to recover from crashes. Is this the drive or perhaps because I am running it on a SATA port through an adapter plug? I\'d love an answer to this one.

I\'ll keep plugging away at it - though if I shut down BOINC before shutting windows I have a very strong hunch that I will never see the error again.

Keep the suggestions coming though.
ID: 30825 · Report as offensive     Reply Quote
DJStarfox

Send message
Joined: 27 Jan 07
Posts: 300
Credit: 3,288,263
RAC: 26,370
Message 30827 - Posted: 6 Oct 2007, 2:19:05 UTC - in response to Message 30825.  

I thought of hard drive caching, and found myself frustrated. The option is not visible on the hard drive properties. I thought I was going nuts, but it is present on the spare drive I plugged in to recover from crashes. Is this the drive or perhaps because I am running it on a SATA port through an adapter plug? I\'d love an answer to this one.

I\'ll keep plugging away at it - though if I shut down BOINC before shutting windows I have a very strong hunch that I will never see the error again.


Are you running BOINC as a service or just an application?

I think you\'ve solved your own problem in that regard. I know from the moderators repeating it that BOINC needs to be shutdown before Windows OS shutdown is initiated. Give it at least 5 seconds to stop the model(s) before shutting down Windows.

If you\'re hardcore, try writing a simple shutdown script for BOINC if it\'s a service. That way your computer will always shutdown BOINC. I haven\'t tested this so good luck.
http://support.microsoft.com/kb/322241/EN-US/

As far as drive write caching, only IDE drives and logical RAID drives have that option. SATA is treated as SCSI interface so the setting is not supported (disabled). The option is related to the interface/controller, not the drive.

Speedfan can monitor all your motherboard sensors, including HDD (if you\'re a local administrator).
ID: 30827 · Report as offensive     Reply Quote
Ant B

Send message
Joined: 29 Mar 06
Posts: 8
Credit: 2,793,692
RAC: 0
Message 30980 - Posted: 16 Oct 2007, 19:56:34 UTC

A quick update on progress so far.

Firstly - BOINC and the model are out of the frame and it does look like hardware. Memory checks still run fine, swapping out the chips doesn\'t help - they fail individually and in combination. In fact things seem to get worse and I have begun getting BSoD in operation now. Memory dumps are varied (the problem is you get the last one, never the first) but I think do point to the disk controller. Fiddling with BIOS and turning on memory execute protection, memory compatibility, turning off speedstep and various other bits just make it worse. I think one model crashed as a result the other day.

So I am uninstalling the new updated drivers for I/O controllers and hard drive controllers and installing two versions back - we will see what happens.

Wish me luck

Anthony
ID: 30980 · Report as offensive     Reply Quote
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 30983 - Posted: 17 Oct 2007, 1:50:08 UTC

Let us know how you get on.
Cpdn news
ID: 30983 · Report as offensive     Reply Quote
Ant B

Send message
Joined: 29 Mar 06
Posts: 8
Credit: 2,793,692
RAC: 0
Message 31079 - Posted: 23 Oct 2007, 21:16:11 UTC

Success, it seems!

Much rejoicing here since uninstalling the controller drivers and restoring all the BIOS settings to where they were previously. Immediately better. There was one further BSoD the next day, but the reboot was uneventful, and only one dump (IRQL_NOT_MORE_OR_LESS_EQUAL) pointing to IRQ14 again. So I backdated the IDE controller driver (previously I had done only the SATA driver) and - hey presto. All stable for a week now, and happily crunching. There doesn\'t appear to be any noticeable decrement in performance either.

The lesson is that the most up to date drivers sometimes aren\'t the best.

So apart from restoring faith in CPDN and my hardware, it saved me having to learn how to use debugging tools - that would have wasted a few evenings....

Thanks everyone.

Anthony
ID: 31079 · Report as offensive     Reply Quote
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 31090 - Posted: 24 Oct 2007, 2:36:03 UTC

Congratulations - I don\'t think I\'d have known how to fix all that!
Cpdn news
ID: 31090 · Report as offensive     Reply Quote

Message boards : Number crunching : Optimised I/O 5.44 problem?

©2024 cpdn.org