climateprediction.net home page
Batch 1008, and test batches 1009 to 1014 for Windows - issues

Batch 1008, and test batches 1009 to 1014 for Windows - issues

Message boards : Number crunching : Batch 1008, and test batches 1009 to 1014 for Windows - issues
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 · Next

AuthorMessage
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 943
Credit: 34,351,254
RAC: 10,403
Message 70700 - Posted: 3 Apr 2024, 7:14:09 UTC
Last modified: 3 Apr 2024, 7:25:13 UTC

I started 8 tasks yesterday, starting soon after release (2 tasks each on four machines, all Intel i5). Sample task:

wah2_eas25_n2nl_201712_24_1008_012274697_0

The only clue I can see so far is:

Controller:: CPDN process is not running, exiting, bRetVal = T, checkPID = 2136, selfPID = 7432, iMonCtr = 2
Model crash detected, will try to restart...
Global Worker:: CPDN process is not running, exiting, bRetVal = T, checkPID = 2136, selfPID = 4880, iMonCtr = 2
All failed after about 8 hours, round about where the first trickle would have been expected.

I'll have another look round in more detail later.

Edit - looks like they didn't send either a credit trickle or a data trickle. But this machine did send an out.zip, which may help.
ID: 70700 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 9 Dec 05
Posts: 111
Credit: 12,038,780
RAC: 1,393
Message 70701 - Posted: 3 Apr 2024, 7:55:38 UTC

ID: 70701 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 247
Credit: 12,048,302
RAC: 14,831
Message 70702 - Posted: 3 Apr 2024, 7:59:54 UTC

I had 2 crash just shy of 12 hour mark on same PC (i7-4790) with same errors. Seems like global model is crashing?

https://www.cpdn.org/result.php?resultid=22417487
https://www.cpdn.org/result.php?resultid=22416101
ID: 70702 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 814
Credit: 13,663,992
RAC: 8,399
Message 70703 - Posted: 3 Apr 2024, 8:38:50 UTC - in response to Message 70702.  
Last modified: 3 Apr 2024, 8:46:39 UTC

I've found this myself and done some preliminary investigation. In the 3 task failures I've had all were due to the regional model crashing as it tried to run 1st/Jan. The forecasts all start from 1/Dec. I'm looking into it.

The only pattern I've noticed (if it is a pattern), is that my failures were on a Win10 VM running on a intel chip, whereas the same VM running on a AMD has got 3 tasks past 1/Jan.

I'll be running a failure workunit standalone to debug what's going on. The other two batches have been held pending investigation of possible issues with this one.

p.s. to determine which model has failed, look in the stderr for these lines:
executeModelProcess: MonID=8904, GCM_PID=10012, RCM_PID=252
23:57:52 (252): called boinc_finish(193)
Global Worker:: CPDN process is not running, exiting, bRetVal = T, checkPID = 252, selfPID = 10012, iMonCtr = 2
Controller:: CPDN process is not running, exiting, bRetVal = T, checkPID = 10012, selfPID = 8904, iMonCtr = 1

'Global worker' is the global model and it says it's checking process id = 252. From the executeModelProcess line above it, this process id belongs to the regional model (RCM_PID). If the regional model dies then the global model dies as well. Hence the 'CPDN process is not running, exiting.' The monitor controller process then reports the global model has died and it then dies.
To find out where the model was, navigate to the task folder in your boinc 'data' folder in 'projects/climateprediction.net' and you'll find a stdout_mon.txt file with the timesteps listed.
---
CPDN Visiting Scientist
ID: 70703 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 943
Credit: 34,351,254
RAC: 10,403
Message 70704 - Posted: 3 Apr 2024, 9:06:16 UTC - in response to Message 70703.  

A sample from one of mine:

Task 22416552

std_err:
executeModelProcess: MonID=4880, GCM_PID=3548, RCM_PID=4548
04:27:51 (4548): called boinc_finish(193)
Global Worker:: CPDN process is not running, exiting, bRetVal = T, checkPID = Controller4548
stdout_mon.txt:
...
wah2_eas25_n0ko_201012_24_1008_012272000 - PH 1 TS 0011616 A - 02/01/2011 00:00 - H:M:S=0008:25:01 AVG= 2.61 DLT= 1.22
wah2_eas25_n0ko_201012_24_1008_012272000 - PH 1 TS 0011617 P - 01/01/2011 00:05 - H:M:S=0008:25:07 AVG= 2.61 DLT= 5.41
wah2_eas25_n0ko_201012_24_1008_012272000 - PH 1 TS 0011618 P - 01/01/2011 00:10 - H:M:S=0008:25:16 AVG= 2.61 DLT= 9.38
Model crash detected, will try to restart...
Slight garble in std_err, but seems to be the same thing.

My machines are all Intel hardware, mostly running Windows 7 Professional x64 without any emulation layer. My Windows 11 laptop also got two tasks, which are still running - I'll keep an eye on them.
ID: 70704 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 814
Credit: 13,663,992
RAC: 8,399
Message 70705 - Posted: 3 Apr 2024, 9:17:02 UTC - in response to Message 70704.  

They will all be the same output. There does appear to be some difference in success rate between intel & amd but for now I'm running a failed task standalone to see what's going on in more detail.
---
CPDN Visiting Scientist
ID: 70705 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 943
Credit: 34,351,254
RAC: 10,403
Message 70708 - Posted: 3 Apr 2024, 11:14:54 UTC - in response to Message 70705.  
Last modified: 3 Apr 2024, 11:33:04 UTC

My Windows 11 laptop has now crashed its two tasks as well, at the same place. I've held back the out.zip file for the time being, in case it's any use, but it sounds like the offline debug run will be a better bet.

Edit - in view of the reply, I'll let them go. Holding back on even requesting new tasks until we get the go-ahead.
ID: 70708 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 814
Credit: 13,663,992
RAC: 8,399
Message 70709 - Posted: 3 Apr 2024, 11:19:08 UTC - in response to Message 70708.  

Thanks Richard, but it won't be of any use. There is not enough information in the returned files to determine the cause. Workunits use different input files to get the forecast spread. It might be related to a problem in one of the files some of the workunits use. First step is to reproduce it locally and we'll go from there.
---
CPDN Visiting Scientist
ID: 70709 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4350
Credit: 16,558,487
RAC: 4,810
Message 70710 - Posted: 3 Apr 2024, 11:30:28 UTC

The only pattern I've noticed (if it is a pattern), is that my failures were on a Win10 VM running on a intel chip, whereas the same VM running on a AMD has got 3 tasks past 1/Jan.


Any idea of the percentage of Intel vs AMD chips. I have been trawling and every single failure I have looked at has been Intel but, the overwhelming majority of tasks have not returned a zip yet so there is no evidence they are running correctly. Mine which have returned zips are all Wind10 in a VM as opposed to WINE which might mask failures. (All on AMD Ryzen 7 3700X )

I guess we might have more data by tomorrow morning when most computers running 24/7 should have either failed tasks or produced zips.
ID: 70710 · Report as offensive     Reply Quote
Yeti
Avatar

Send message
Joined: 5 Aug 04
Posts: 171
Credit: 10,364,481
RAC: 21,716
Message 70711 - Posted: 3 Apr 2024, 12:06:35 UTC

Mine tasks have all failed on Intel-XEONs with varying Generations


Supporting BOINC, a great concept !
ID: 70711 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1067
Credit: 16,546,621
RAC: 2,321
Message 70712 - Posted: 3 Apr 2024, 13:04:48 UTC - in response to Message 70700.  

I got four tasks yesterday, separatd by an hour each. Machine is running Windows 10 with Intel processor.

Computer 1512658
Computer information

CPU type 	GenuineIntel
11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz [Family 6 Model 140 Stepping 1]
Number of processors 	8
Coprocessors 	---
Virtualization 	None
Operating System 	Microsoft Windows 10
Core x64 Edition, (10.00.19045.00)
BOINC version 	7.24.1
Memory 	15.64 GB
Cache 	256 KB
Swap space 	18.02 GB
Total disk space 	460.73 GB
Free Disk Space 	366.06 GB
Measured floating point speed 	3.91 billion ops/sec
Measured integer speed 	21.76 billion ops/sec
Average upload rate 	113.53 KB/sec
Average download rate 	7120.42 KB/sec
Average turnaround time 	12.32 days


Tasks were 22418337 22417523 22415964 22419311

They all died after running about 12 1/2 hours. -- 45000 seconds.
ID: 70712 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 814
Credit: 13,663,992
RAC: 8,399
Message 70713 - Posted: 3 Apr 2024, 14:08:27 UTC - in response to Message 70710.  

Dave, you might recall your dev test did fail and that was on AMD. Without running some analysis on the database I can't give you a good answer. Repeating a failed task standalone reproduces the failure, so I've got something to debug now.

The only pattern I've noticed (if it is a pattern), is that my failures were on a Win10 VM running on a intel chip, whereas the same VM running on a AMD has got 3 tasks past 1/Jan.

Any idea of the percentage of Intel vs AMD chips. I have been trawling and every single failure I have looked at has been Intel but, the overwhelming majority of tasks have not returned a zip yet so there is no evidence they are running correctly. Mine which have returned zips are all Wind10 in a VM as opposed to WINE which might mask failures. (All on AMD Ryzen 7 3700X )

---
CPDN Visiting Scientist
ID: 70713 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4350
Credit: 16,558,487
RAC: 4,810
Message 70714 - Posted: 3 Apr 2024, 14:16:08 UTC

Dave, you might recall your dev test did fail and that was on AMD.
And that one completed for Richard on an Intel machine
ID: 70714 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 814
Credit: 13,663,992
RAC: 8,399
Message 70715 - Posted: 3 Apr 2024, 14:24:57 UTC - in response to Message 70714.  
Last modified: 3 Apr 2024, 14:26:33 UTC

Dave, you might recall your dev test did fail and that was on AMD.
And that one completed for Richard on an Intel machine
Yup. But all my Intel based workunits are failing for 1008 and the only ones working at the minute are on AMD (scratching of head). I don't think it's a particular input file as they are different between the failed tasks. So for the time being, the focus is on understanding what the code is doing.
---
CPDN Visiting Scientist
ID: 70715 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 943
Credit: 34,351,254
RAC: 10,403
Message 70718 - Posted: 3 Apr 2024, 15:11:48 UTC - in response to Message 70715.  

I second that opinion. Just seen my final two fail - that's 12 out of 12, all on Intel - and it includes the machine that processed the dev site task that failed for Dave.
ID: 70718 · Report as offensive     Reply Quote
rob

Send message
Joined: 5 Jun 09
Posts: 79
Credit: 3,043,532
RAC: 3,470
Message 70721 - Posted: 3 Apr 2024, 15:59:51 UTC

In support of Glenn's comment about tasks running on AMD processors getting further along - the first 1008 batch task on my PC running windows 10 has passed the first trickle back, the rest of my collection are a few hours behind.
More will be revealed in the next few hours, or, hopefully days......
ID: 70721 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 943
Credit: 34,351,254
RAC: 10,403
Message 70723 - Posted: 3 Apr 2024, 16:38:46 UTC

Brainstorming it through with myself, could it be a compiler switch gone rogue? Might you inadvertently be compiling it with optimisations that work on AMD chips only, inserting opcodes the are valid for AMD but aren't available on Intel?
ID: 70723 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 814
Credit: 13,663,992
RAC: 8,399
Message 70724 - Posted: 3 Apr 2024, 17:13:48 UTC - in response to Message 70723.  

I doubt it. Wrong flags would have been picked up at compile time. It's exactly same executable used successfully for the 1006 and 1007 batches. But this input data is causing a problem.
Optimization is enabled up to O2 and code dispatch up to SSE 4.2. I'm not an expert on AMD but I believe it also supports 4.2.
I note the models all seem to fail on 1/Jan which suggests a problem with the input data in some way, maybe related to precision. Could be optimisation of fortran77 code by modern compiler playing a role too. I've got a fun few days ahead :)
---
CPDN Visiting Scientist
ID: 70724 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4350
Credit: 16,558,487
RAC: 4,810
Message 70727 - Posted: 3 Apr 2024, 19:29:49 UTC

Yes, phenom2, all Ryzen and thread ripper CPUs support SSE4.2
ID: 70727 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 247
Credit: 12,048,302
RAC: 14,831
Message 70729 - Posted: 3 Apr 2024, 21:09:17 UTC

All 6 on my Intel PC crashed too, the ones on AMD are humming along. Sounds like a version of the old Y2K problem, switch to a new year - crash. :-D
ID: 70729 · Report as offensive     Reply Quote
1 · 2 · 3 · 4 · Next

Message boards : Number crunching : Batch 1008, and test batches 1009 to 1014 for Windows - issues

©2024 climateprediction.net