Batch 1008, and test batches 1009 to 1014 for Windows

Author	Message
Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 943 Credit: 34,351,254 RAC: 10,403	Message 70733 - Posted: 4 Apr 2024, 7:54:31 UTC - in response to Message 70727. Yes, phenom2, all Ryzen and thread ripper CPUs support SSE4.2 I think I had a vague memory of SSE4a, but that'll be ancient history for current processors. ID: 70733 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4349 Credit: 16,556,002 RAC: 4,603	Message 70734 - Posted: 4 Apr 2024, 13:47:29 UTC Last modified: 4 Apr 2024, 17:40:13 UTC I would suggest that those with Intel processors set CPDN to no new tasks till this is sorted. Edit: It is possible the batch might be closed which would stop resends and let those with work on AMD machines complete it. Edit: I think it is being paused which will stop resends. I have looked at over 20 hard fails, every single one is at the same point on an Intel machine. I have seven from the batch on my machine, Four have produced 5zips and trickle up messages, one four and two waiting to start. It is most odd. ID: 70734 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 814 Credit: 13,663,992 RAC: 8,399	Message 70736 - Posted: 4 Apr 2024, 18:07:10 UTC - in response to Message 70734. I believe the Intel runs are behaving correctly and failing. It's the AMD runs not behaving. Yes, this batch will be stopped from producing resends until we understand why testing did not show this problem. --- CPDN Visiting Scientist ID: 70736 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1067 Credit: 16,546,621 RAC: 2,321	Message 70737 - Posted: 4 Apr 2024, 18:28:14 UTC - in response to Message 70727. My pipsqueak computer, that crashed my latest four CPDN tasks has a CPU chip with these features. Computer 1512658 CPU type GenuineIntel 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz [Family 6 Model 140 Stepping 1] Number of processors 8 Coprocessors --- Virtualization None Operating System Microsoft Windows 10 Core x64 Edition, (10.00.19045.00) BOINC version 7.24.1 Memory 15.64 GB Cache 256 KB Instruction Set Extensions Intel® SSE4.1, Intel® SSE4.2, Intel® AVX2, Intel® AVX-512 ID: 70737 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4349 Credit: 16,556,002 RAC: 4,603	Message 70738 - Posted: 4 Apr 2024, 20:26:44 UTC I believe the Intel runs are behaving correctly and failing. It's the AMD runs not behaving. Should I just abort the two that are yet to start? I have five others that I can save files from that have all produced either 4 or 5 zips. Or would looking at what happens at the point where they fail on Intel machines be more useful? ID: 70738 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 814 Credit: 13,663,992 RAC: 8,399	Message 70740 - Posted: 4 Apr 2024, 20:54:44 UTC - in response to Message 70738. Last modified: 4 Apr 2024, 20:55:40 UTC I believe the Intel runs are behaving correctly and failing. It's the AMD runs not behaving. Should I just abort the two that are yet to start? I have five others that I can save files from that have all produced either 4 or 5 zips. Or would looking at what happens at the point where they fail on Intel machines be more useful? Hard to answer that as I'm not the project scientist and it's really their call together with CPDN. Personally, as a developer I have all the kit I need to debug on intel & AMD so don't spend time saving files. As a volunteer, if it was me, I'd abort the tasks yet to start and keep running the tasks currently going until told otherwise. They might be useful for comparison later. Sorry Dave, that's the best answer I can give at the moment. --- CPDN Visiting Scientist ID: 70740 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4349 Credit: 16,556,002 RAC: 4,603	Message 70742 - Posted: 5 Apr 2024, 4:57:06 UTC Thanks Glen. I will abort the two not started yet as credit isn't a issue for me. ID: 70742 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 814 Credit: 13,663,992 RAC: 8,399	Message 70743 - Posted: 5 Apr 2024, 9:37:31 UTC Last modified: 5 Apr 2024, 9:37:52 UTC There will be a small batch of about 100 workunits going out soon to test whether the issue we're seeing this with this batch is related to some of the input files. --- CPDN Visiting Scientist ID: 70743 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4349 Credit: 16,556,002 RAC: 4,603	Message 70744 - Posted: 5 Apr 2024, 9:54:37 UTC - in response to Message 70742. Thanks Glen. I will abort the two not started yet as credit isn't a issue for me. I was clearly a bit premature with that as I have picked up one more resend from 1008. ID: 70744 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 943 Credit: 34,351,254 RAC: 10,403	Message 70745 - Posted: 5 Apr 2024, 10:25:35 UTC Last modified: 5 Apr 2024, 10:38:39 UTC I'm restarting work fetch on my 6 Windows machines, on a 10-minute stagger and with a limit of one per machine - that should maximise my chances of being one of the 'select 100'. Edit - and the next one in line got a task. Unfortunately, like Dave's, it's a resend from the previous (failing) run. Glenn, should I keep it, or send it straight back? It's the third copy, so should kill the workunit if I abort it. Workunit 12273481 ID: 70745 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4349 Credit: 16,556,002 RAC: 4,603	Message 70746 - Posted: 5 Apr 2024, 14:13:25 UTC I believe the Intel runs are behaving correctly and failing. It's the AMD runs not behaving. The only reason I haven't asked why is I almost certainly will not understand the answer! ;) ID: 70746 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 943 Credit: 34,351,254 RAC: 10,403	Message 70747 - Posted: 5 Apr 2024, 14:39:17 UTC Four of my six machines have now got resends from the 2nd April batch, but there's no sign of the test batch yet. I'll keep these out of circulation for the time being, until and unless Glenn can give us a more precise ETA. The trouble is that if our clients get consistent "no tasks" replies from the server, they stop asking (or at least, they ask less frequently). BOINC doesn't really take the needs of this type of test into account. ID: 70747 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4349 Credit: 16,556,002 RAC: 4,603	Message 70748 - Posted: 5 Apr 2024, 15:30:35 UTC Last modified: 5 Apr 2024, 16:01:39 UTC I have deleted the last resend. It was a _2 so won't be sent again now. I have left the five started tasks from 1008 going and there is a resend from 1007 at 88%. I have also set the machine to no new tasks till I get some hints about the imminentness of the 100 tasks being released. Edit:I think if BOINC were to cater for this type of test it would almost certainly mess something else up! Edit2: Given the time I would not be surprised if the test doesn't arrive till Monday though I have been caught out before by batches being released over the weekend. ID: 70748 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 943 Credit: 34,351,254 RAC: 10,403	Message 70749 - Posted: 5 Apr 2024, 17:02:17 UTC Last modified: 5 Apr 2024, 17:03:05 UTC Starting to get some tasks from batch 1009 - I assume these are the test run. So far, got tasks 22424380 and 22424396. Not seeing them on the server status page yet, but that doesn't update in real time. ID: 70749 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4349 Credit: 16,556,002 RAC: 4,603	Message 70750 - Posted: 5 Apr 2024, 17:47:38 UTC Last modified: 5 Apr 2024, 18:47:24 UTC Starting to get some tasks from batch 1009 - I assume these are the test run. I can confirm these are from the test batch of 100 tasks. Edit: And I would guess they have all gone now so I won't get any unless there are failures. ID: 70750 · Reply Quote

bullschuck Send message Joined: 22 May 21 Posts: 37 Credit: 552,494 RAC: 4,025	Message 70752 - Posted: 6 Apr 2024, 2:19:56 UTC - in response to Message 70749. Last modified: 6 Apr 2024, 2:21:14 UTC So far, got tasks 22424380 and 22424396. Looks like both of these errored out as well. ID: 70752 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 943 Credit: 34,351,254 RAC: 10,403	Message 70753 - Posted: 6 Apr 2024, 6:46:19 UTC - in response to Message 70752. Looks like both of these errored out as well. Yes, and at exactly the same place. The stdout_mon.txt file for 22424396 ends with: ... wah2_eas25_n01t_200912_24_1009_012276361 - PH 1 TS 0011616 A - 02/01/2010 00:00 - H:M:S=0007:43:16 AVG= 2.39 DLT= 1.15 wah2_eas25_n01t_200912_24_1009_012276361 - PH 1 TS 0011617 P - 01/01/2010 00:05 - H:M:S=0007:43:21 AVG= 2.39 DLT= 5.10 wah2_eas25_n01t_200912_24_1009_012276361 - PH 1 TS 0011618 P - 01/01/2010 00:10 - H:M:S=0007:43:29 AVG= 2.39 DLT= 8.17 Model crash detected, will try to restart... The other machine is in a different room, and I'll check it later. ID: 70753 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 814 Credit: 13,663,992 RAC: 8,399	Message 70756 - Posted: 6 Apr 2024, 11:28:37 UTC I don't want anyone spending any time looking at their failed tasks. Appreciate the response for the small test. There is one clue in the log output (which might be a red herring). The regional model calls boinc_finish with an error code of 193. In windows that means a bad executable so I'm looking at the library the model loads dynamically during the run to handle converting the model output. It's possible it's been corrupted in some way. If that's not it, then it's back to the model code. --- CPDN Visiting Scientist ID: 70756 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 943 Credit: 34,351,254 RAC: 10,403	Message 70757 - Posted: 6 Apr 2024, 12:10:40 UTC - in response to Message 70756. Hmmm. I wouldn't be totally sure about that one. A windows app which fails to start because of a missing DLL usually bombs out with: - exit code -1073741515 (0xc0000135) and the generic description is "The application failed to initialize properly". BOINC (and hence the BOINC library which is linked into the app or the wrapper) has it's own set of error codes, which you can find at: https://github.com/BOINC/boinc/blob/master/lib/error_numbers.h They include both positive and negative values, so I'd suspect both of these, in addition to the MS Windows numbers: #define EXIT_SIGNAL 193 // app was killed by signal #define ERR_INVALID_EVENT -193 That doesn't get us much further forward, but I'm still in brainstorming mode. ID: 70757 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 814 Credit: 13,663,992 RAC: 8,399	Message 70758 - Posted: 6 Apr 2024, 15:53:02 UTC - in response to Message 70757. Last modified: 6 Apr 2024, 15:55:55 UTC Hi Richard, the model only loads the external library when it needs to convert the model raw output ready for sending. That doesn't happen at model start, but at fixed points in the forecast. So the model will start fine and load the library after some time. Hence a possible explanation for why they all fail on 1/Jan. The boinc_finish error code is whatever value was passed to it. It could come from the return/errno value of LoadLibrary() call, or, it might come from a fortran operation. I'm still looking for the exact point of failure in the code. (https://stackoverflow.com/questions/38579909/loadlibrary-fails-with-error-code-193) --- CPDN Visiting Scientist ID: 70758 · Reply Quote

Batch 1008, and test batches 1009 to 1014 for Windows - issues