climateprediction.net home page
2 recent wah2 crashes - all tasks

2 recent wah2 crashes - all tasks

Message boards : Number crunching : 2 recent wah2 crashes - all tasks
Message board moderation

To post messages, you must log in.

AuthorMessage
MartinNZ

Send message
Joined: 22 Mar 06
Posts: 144
Credit: 24,695,428
RAC: 0
Message 54432 - Posted: 6 Jul 2016, 3:47:31 UTC

I've been having a great run over the past months, but I've noticed 2 recent crashes where all the 10 running tasks have bombed out; 4 Jul 21:15 & 26 Jun 11:57. These are mostly wah2 models, see the errors here (Comp ID 1290283) It's one thing for a model crash, but to take out all the others is a bit of a feat.

My first thought was computer error, but now I'm not so sure. Looking through the work units, only one of these have gone onto completion on any other PC so far. (I've also found a large number of PCs with very high failure which I'll report in the appropriate place.)

Some of the PCs are showing similar symptoms, in that there are multiple tasks on error at the same time stamp, check 1323410 and 1211978 This seems to be a bit too regular to be a coincidence. I assume the number of errors at each time is dependent on the number of tasks being run.

Again, scanning through the work units, there seem to be a large number of wah2 failures, with only one or two PCs having (mixed) success. Example here

Although I now don't think it's relevant, I've run chkdsk on the BOINC hard drive (no errors reported), but no memory test as yet. I can't really run a soak test until Sat morning.

So the questions are:
1. Is the likelihood these are PC based errors? I suggest not.
2. If yes I can either run all the current task until they complete or explode, or stop processing until after the soak test. Suggestions?
3. Should we be excluding wah2 from the models to run?

All this is very reminiscent to similar issues I had some moons ago. Spent ages trying to fault find when it turned out to be a model issue.
ID: 54432 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 54436 - Posted: 6 Jul 2016, 5:29:57 UTC - in response to Message 54432.  

I've only had one failure of WaH2 in ages on my machines.

ID: 54436 · Report as offensive     Reply Quote
MartinNZ

Send message
Joined: 22 Mar 06
Posts: 144
Credit: 24,695,428
RAC: 0
Message 54437 - Posted: 6 Jul 2016, 6:20:51 UTC - in response to Message 54436.  

Hi Les,
Until these two crashes I could say the same. If I exclude the latest failures, so far this year the PC has crunched 182 task successfully and only 6 have had errors which I reckon is pretty reasonable. Hence the concern.
ID: 54437 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,376,018
RAC: 3,616
Message 54438 - Posted: 6 Jul 2016, 8:11:26 UTC - in response to Message 54432.  

Hi Martin,
Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x00463327 read attempt to address 0x010803AC


Read attempt being denied couldn't be some change or upgrade to anti-virus software could it? Of course, not having used windows this century I could be talking out of my fundament on this.

The other thought that occurs to me for multiple tasks being taken out at once is power spikes/outages. Obviously you would know if your own had been an outage but it wouldn't be clear on machines not your own.

Just been looking at the two machines you link to. 1323410 is a linux box with missing libraries according to the error message.
ID: 54438 · Report as offensive     Reply Quote
Art Masson
Avatar

Send message
Joined: 16 Oct 11
Posts: 254
Credit: 15,954,577
RAC: 20
Message 54443 - Posted: 6 Jul 2016, 16:03:26 UTC - in response to Message 54437.  

Hi Martin,

Not sure of any relationships, but I've discovered that I can not run my Intel I7 CPU at 100% BOINC processing or multiple CPDN jobs fail at the same time(or close to the same time). I don't know root cause, but I have to run my I7 box allowing only a maximum of 6 CPUs active for BOINC processing. If I run at 100% using all 8 cores, CPDN jobs inevitably fail (but other projects such as SETI, MILKYWAY, EINSTAIN, etc. are not affected running simultaneously). Thinking perhaps this problem might be occurring on your Intel I7 machine. Just a thought -- certainly wouldn't explain the Linux machine failure.

My machine running this way is ID 1266353 if you'd like to compare specs in more detail, please let me know.

Art Masson
St. Charles, IL
ID: 54443 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 54444 - Posted: 6 Jul 2016, 23:57:10 UTC - in response to Message 54432.  

I've been having a great run over the past months, but I've noticed 2 recent crashes where all the 10 running tasks have bombed out; 4 Jul 21:15 & 26 Jun 11:57. These are mostly wah2 models, see the errors here (Comp ID 1290283) It's one thing for a model crash, but to take out all the others is a bit of a feat.

Do you have enough working memory? I once ran short, and I think it took out three work units at once.

Also, check disk memory; the WAH2 work units grow in the disk space needed. At the end of the run (usually about a week for the longest ones for me), they double in the disk space needed as compared to when they started. It might lead to trouble if you don't have enough.
ID: 54444 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,376,018
RAC: 3,616
Message 54445 - Posted: 7 Jul 2016, 8:37:35 UTC - in response to Message 54444.  

I notice this machine http://climateapps2.oerc.ox.ac.uk/cpdnboinc/show_host_detail.php?hostid=1324422 has some times when a bunch of tasks fail with the same time stamp but it doesn't have a very high success rate overall anyway. I suspect because with 8 cores and less than 8GB of Ram once graphics have taken their share that is the cause. Shouldn't be an issue on your box however Martin with 2GB/core.
ID: 54445 · Report as offensive     Reply Quote
MartinNZ

Send message
Joined: 22 Mar 06
Posts: 144
Credit: 24,695,428
RAC: 0
Message 54448 - Posted: 7 Jul 2016, 22:39:24 UTC - in response to Message 54445.  

Thanks everyone. As Dave noticed, plenty of memory. This is a Xeon workstation with 32 GB ECC RAM, and CPDN having its own 2 TB data hard drive. Art, I noticed your thread and decided this was a separate issue. I run 10 hyperthreading task (out of possible 16), as long ago decided that was the most effective combination - never caused an issue in the past. Sometimes I suspend BOINC when I'm doing some particularly intensive work on the PC, but again no link with the crash times.

I'll run a soak test in the weekend, but will be surprised if it throws anything up - running CPDN does a pretty good job as a soak test anyway.

It could be a software update of course, but I always suspend and exit BOINC before doing any work. I have noticed a couple of big Norton updates coming through recently that require the system to be rebooted, but cannot recall if the times were coincident. Norton notifies that a restart is required, but what I don't know is how much installation work it has done in the background before the notification.

ID: 54448 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 54454 - Posted: 8 Jul 2016, 10:53:05 UTC - in response to Message 54448.  

I have noticed a couple of big Norton updates coming through recently that require the system to be rebooted, but cannot recall if the times were coincident. Norton notifies that a restart is required, but what I don't know is how much installation work it has done in the background before the notification.

I think the butler did it. It may not even be related to the updates, but just the normal operation of the AV as it inspects the files. Even the exclusions do not prevent all problems.
ID: 54454 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 54471 - Posted: 10 Jul 2016, 7:46:19 UTC
Last modified: 10 Jul 2016, 7:47:55 UTC

Now this is curious. After posting above, I converted my CPDN machine from Windows 7 64-bit to Windows 10 64-bit. There were no CPDN work units running at the time, and it was a clean install; I did not upgrade from Win7 to Win10. This machine runs 3 cores (i5-3550) on CPDN, and the other core supports a GTX 970 on Folding. It is normally very stable, has an UPS on it to protect against power outages, and has plenty of memory and disk space. Also, I had disabled Windows Defender (which takes a registry hack by the way), and there was no other security software. It is a dedicated machine that I monitor remotely over the LAN, so I don't do web browsing, etc. on it.

Only one HadAM3P-HadRM3P Europe v7.28 work unit downloaded initially, and it failed after almost 4 hours. Then three more downloaded over the course of several hours, and they all failed at the same time.
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/results.php?hostid=1402819

Whether this is due to something on my machine, or due to the work units themselves I don't know at this point. But the machine is now running only 8.12 Weather At Home 2 (wah2), which are generally more stable than the others.
ID: 54471 · Report as offensive     Reply Quote
MartinNZ

Send message
Joined: 22 Mar 06
Posts: 144
Credit: 24,695,428
RAC: 0
Message 54473 - Posted: 10 Jul 2016, 11:38:05 UTC - in response to Message 54454.  


I think the butler did it.

But was it with the lead pipe? ;-)

Well, I did several stress tests over the weekend. Everything passed with flying colours.

Over the years, Prime95 has come out as the favourite on the CPDN boards, so I started with that. Hmmm, looking at the memory use, it was only using about 6.7 GB. As I have 32 GB of RAM, testing only 25% of it doesn't seem like a good idea. So looked about and decided on Aida64 and ran its stress test for 25 hours. Not only did it seem to use all available memory, but from web chatter it seems to be better suited for modern processors. E.g. see the wiki article here and note the comment from the Asus engineer about the Intel X79 system, which just happens to be mine. OK, it might not push the maths the same as Prime97, but the PC is used for loads of other work as well, all of which work the system in ways Prime97 doesn't. Just for good measure I ran heaps of other programs like Photoshop, Excel etc while Aida64 was testing. Nothing seemed to throw it, although things did get a litte slow at times - no surprises on that one!

Just to be sure I also did 4 hours with OCCT and another 4 hours with Prime97.

So the upshot is that I'll keep the PC plodding on cranking out the results. I still think something odd is happening though.
ID: 54473 · Report as offensive     Reply Quote

Message boards : Number crunching : 2 recent wah2 crashes - all tasks

©2024 climateprediction.net