climateprediction.net home page
Batch 1015 Discussion/problems

Batch 1015 Discussion/problems

Message boards : Number crunching : Batch 1015 Discussion/problems
Message board moderation

To post messages, you must log in.

AuthorMessage
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1067
Credit: 16,546,621
RAC: 2,321
Message 70822 - Posted: 15 Apr 2024, 12:15:38 UTC - in response to Message 70821.  

Batch 1015 is being released now. This is the next batch in the East Asia 25km configuration (eas25).


I just got one of these on my little machine (Computer 1512658). It has a little over 1/2 hour on it now, So it did not crash on start-up. It predicts about 16 days to go.
ID: 70822 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4350
Credit: 16,558,487
RAC: 4,810
Message 70823 - Posted: 15 Apr 2024, 14:19:17 UTC
Last modified: 15 Apr 2024, 14:23:41 UTC

Six downloaded here. (Placeholder for this batch)
I assume this means you have tracked down the problem Glenn?
ID: 70823 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1067
Credit: 16,546,621
RAC: 2,321
Message 70824 - Posted: 15 Apr 2024, 14:26:59 UTC - in response to Message 70823.  

I now have two on my pipsqueak Windows10 machine. ID: 1512658 They both seem to be running OK. Predicted to take about 16 days but by eyeball it looks like they will be a little faster than that.
ID: 70824 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 247
Credit: 12,048,302
RAC: 14,831
Message 70825 - Posted: 15 Apr 2024, 19:35:46 UTC

It seems like that task directory & files that should go into the slots directory still goes into the projects/climateprediction.net directory. When I ran out of work a couple of days ago I cleaned out all of the older ones but when I got new work today, new ones appeared.
ID: 70825 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 814
Credit: 13,663,992
RAC: 8,399
Message 70827 - Posted: 15 Apr 2024, 20:19:26 UTC - in response to Message 70825.  
Last modified: 15 Apr 2024, 20:19:56 UTC

It seems like that task directory & files that should go into the slots directory still goes into the projects/climateprediction.net directory. When I ran out of work a couple of days ago I cleaned out all of the older ones but when I got new work today, new ones appeared.
This is changed in the next release. In order to keep consistent results for running projects we keep the version the same for all batches per project.
---
CPDN Visiting Scientist
ID: 70827 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1067
Credit: 16,546,621
RAC: 2,321
Message 70828 - Posted: 16 Apr 2024, 1:44:07 UTC - in response to Message 70824.  
Last modified: 16 Apr 2024, 1:44:56 UTC

One of the two 1015 tasks on my pipsqueak machine has accomplished its first trickle, Another potential obstacle overcome.
ID: 70828 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 943
Credit: 34,351,254
RAC: 10,403
Message 70830 - Posted: 16 Apr 2024, 14:01:28 UTC

Slightly off topic, but related to current issues.

I've been sent a batch 1007 resend:

wah2_eas25_a1cu_199312_24_1007_012266614_1

The previous user had an Intel i9, but only managed three trickles in a fortnight. My i5 will probably run through it in 8 - 9 days, but is it worth it?
ID: 70830 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 814
Credit: 13,663,992
RAC: 8,399
Message 70831 - Posted: 16 Apr 2024, 14:07:33 UTC - in response to Message 70830.  
Last modified: 16 Apr 2024, 14:09:07 UTC

Yes, definitely. Batch 1007 is a valid batch. Don't abort it!

1006 & 1007 might be hitting the deadline for volunteers who have not yet started tasks. That might be why resends are coming.
---
CPDN Visiting Scientist
ID: 70831 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1067
Credit: 16,546,621
RAC: 2,321
Message 70835 - Posted: 17 Apr 2024, 17:18:24 UTC - in response to Message 70828.  

One of the two 1015 tasks on my pipsqueak machine has accomplished its first trickle, Another potential obstacle overcome.


And now each of those two tasks has delivered four trickles.
ID: 70835 · Report as offensive     Reply Quote
David Berg

Send message
Joined: 2 Jul 15
Posts: 18
Credit: 3,926,000
RAC: 1,631
Message 70837 - Posted: 17 Apr 2024, 21:29:08 UTC

I had a Batch 1015 task abort overnight. Only other CPDN tasks would have been running at the time. Everything else was quiesced.

Name wah2_eas25_a3id_201912_24_1015_012281245_0
Workunit 12281245
Created 15 Apr 2024, 10:43:22 UTC
Sent 15 Apr 2024, 14:43:43 UTC
Report deadline 24 Jul 2024, 14:43:43 UTC
Received 17 Apr 2024, 8:27:45 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status 0 (0x00000000)
Computer ID 1367467
Run time 1 days 5 hours 52 min 58 sec
CPU time 1 days 5 hours 52 min 58 sec
Validate state Invalid
Credit 1,678.16
Device peak FLOPS 3.48 GFLOPS
Application version Weather At Home 2 (wah2) (region independent) v8.29
windows_intelx86

Another Batch 1015 task is still running ... as is a Batch 1005 task that started mysteriously last week running version 8.24.
ID: 70837 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 814
Credit: 13,663,992
RAC: 8,399
Message 70838 - Posted: 18 Apr 2024, 9:35:32 UTC - in response to Message 70837.  
Last modified: 18 Apr 2024, 9:35:40 UTC

I had a look, there's no 'stderr' output on the task webpage, which is the task log, so I can't see why the model failed. Though the fact there's no stderr output on the task page itself is a clue.

I notice the PC only has 8Gb RAM. How much RAM do you allocate for BOINC? And how many CPDN tasks do you have running at a time? I suspect a problem with available memory.

Also, memory can get fragmented on Windows (similar to disk fragmentation). It's not impossible the task died because it couldn't allocate a memory segment big enough. The best way to clear memory fragmentation is to reboot the machine.

That's all I can help with on this.
---
CPDN Visiting Scientist
ID: 70838 · Report as offensive     Reply Quote
David Berg

Send message
Joined: 2 Jul 15
Posts: 18
Credit: 3,926,000
RAC: 1,631
Message 70841 - Posted: 18 Apr 2024, 21:20:57 UTC - in response to Message 70838.  

Thank you, Glenn. I've tended to allow the computer to run steadily while CPDN projects are active. I will change my paradigm to begin rebooting at the end of every day, quiescing the CPDN tasks before doing so and then activating them afterward. Following is my profile; I welcome suggestions to improve it. Thank you.

When computer is in use
'In use' means mouse/keyboard input in last 5 minutes
Suspend all computing No
Suspend GPU computing No
Use at most 75 % of the CPUs
Use at most 50 % of CPU time
Suspend when non-BOINC CPU usage is above 50 %
Use at most 50 % of memory
When computer is not in use
Use at most
Requires BOINC 7.20.3+ 75 % of the CPUs
Use at most
Requires BOINC 7.20.3+ 75 % of CPU time
Suspend when non-BOINC CPU usage is above
Requires BOINC 7.20.3+ 75 %
Use at most 75 % of memory
Suspend when no mouse/keyboard input in last --- minutes
General
Suspend when computer is on battery N/A
Switch between tasks every 60 minutes
Request tasks to checkpoint at most every 60 seconds
Leave non-GPU tasks in memory while suspended Yes
Store at least --- days of work
Store up to an additional 0.25 days of work
Compute only between ---
Disk
Use no more than 250 GB
Leave at least 0.001 GB free
Use no more than 50 % of total
Page/swap file: use at most 75 %
Network
Limit download rate to --- KB/second
Limit upload rate to --- KB/second
Limit usage to --- MB every --- days
Transfer files only between ---
Skip data verification for image files
Confirm before connecting to Internet
Disconnect when done
ID: 70841 · Report as offensive     Reply Quote
biodoc

Send message
Joined: 2 Oct 19
Posts: 21
Credit: 46,489,334
RAC: 18,462
Message 70845 - Posted: 19 Apr 2024, 15:10:18 UTC
Last modified: 19 Apr 2024, 15:11:09 UTC

I picked up 32 of these batch 1015 tasks on a 5950X with 64 GB RAM. One failed after the 2nd trickle. The others seem to be doing fine so far.

Here's a link to the one that failed: link to task

Stderr output:
<core_client_version>7.22.2</core_client_version>
<![CDATA[
<message>
The system cannot find the drive specified.
 (0xf) - exit code 15 (0xf)</message>
<stderr_txt>
modelGetExecutables: check control files, strTemp0 & 1 : 
C:\ProgramData\BOINC/projects/climateprediction.net/wah2_eas25_a1j8_201312_24_1015_012278684/jobs/xadae.namelists
C:\ProgramData\BOINC/projects/climateprediction.net/wah2_eas25_a1j8_201312_24_1015_012278684/jobs/xacxf.namelists
modelGetExecutables: unzipping control files : strInput & strTmp 
wah2_eas25_a1j8_201312_24_1015_012278684.zip
wah2_eas25_a1j8_201312_24_1015_012278684/jobs
gstrDump[0] = generic_phase1_spinup_eas25_global_aabaka
gstrDump[1] = generic_phase1_spinup_eas25_regional_aabaka
global model: command string: "C:\ProgramData\BOINC/projects/climateprediction.net/wah2am3m2_um_8.29_windows_intelx86.exe" wah2_eas25_a1j8_201312_24_1015_012278684 generic_phase1_spinup_eas25_global_aabaka ic19610812_12_N96 AERclim_ancil_168months_CMIP6-MIROC6_SST_2009-01-01_2022-12-30_v2404 AERclim_ancil_168months_CMIP6-MIROC6_SIC_2009-01-01_2022-12-30_v2404 SO2DMS_N96_cmip6hist-ssp245_2009-2020 oxi.addfa ozone_cmip6hist-ssp245_N96_1979_2031
regional model: command string: "C:\ProgramData\BOINC/projects/climateprediction.net/wah2rm3m2t_um_8.29_windows_intelx86.exe" wah2_eas25_a1j8_201312_24_1015_012278684
 cpdn_check_running: got RM PID of zero; ignoring this call and waiting for PID via shMem. 
 cpdn_check_running: got RM PID of zero; ignoring this call and waiting for PID via shMem. 
executeModelProcess: MonID=4856, GCM_PID=19780, RCM_PID=1608
Global Worker:: CPDN process is not running, exiting, bRetVal = T, checkPID = 19780, selfPID = 19780, iMonCtr = 1

</stderr_txt>
ID: 70845 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1067
Credit: 16,546,621
RAC: 2,321
Message 70848 - Posted: 19 Apr 2024, 18:42:17 UTC - in response to Message 70835.  

And now each of those two tasks has delivered four trickles.


And now eight trickles.
ID: 70848 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 814
Credit: 13,663,992
RAC: 8,399
Message 70849 - Posted: 19 Apr 2024, 23:36:09 UTC - in response to Message 70845.  

The error message
The system cannot find the drive specified
.is a Windows issue. It has a number of possibilities, disk timeout, failing drive. Might be worth doing a SMART check on the drive concerned if it keeps happening.
This error accounts for about 10% of CPDN WAH task fails.
---
CPDN Visiting Scientist
ID: 70849 · Report as offensive     Reply Quote
biodoc

Send message
Joined: 2 Oct 19
Posts: 21
Credit: 46,489,334
RAC: 18,462
Message 70850 - Posted: 20 Apr 2024, 1:32:31 UTC - in response to Message 70849.  

The error message
The system cannot find the drive specified
.is a Windows issue. It has a number of possibilities, disk timeout, failing drive. Might be worth doing a SMART check on the drive concerned if it keeps happening.
This error accounts for about 10% of CPDN WAH task fails.


Thanks for looking into this Glenn. I'll run a SMART check on the drive when all the tasks are completed.
ID: 70850 · Report as offensive     Reply Quote
SolarSyonyk

Send message
Joined: 7 Sep 16
Posts: 259
Credit: 32,069,627
RAC: 28,154
Message 70851 - Posted: 20 Apr 2024, 2:44:26 UTC - in response to Message 70849.  

The error message
The system cannot find the drive specified
.is a Windows issue. It has a number of possibilities, disk timeout, failing drive. Might be worth doing a SMART check on the drive concerned if it keeps happening.
This error accounts for about 10% of CPDN WAH task fails.


One would think that. But Windows being Windows, there are also apparently some ways it can show up that aren't "drive failure" related, as I've just learned...

I've had two on a machine that, as far as I can tell, has a perfectly good drive (NVMe drive with zero reported problems, virtio block devices through to the Windows 10 VM doing the compute these days).

https://www.cpdn.org/result.php?resultid=22418452
https://www.cpdn.org/result.php?resultid=22418464

I don't know what the codebase looks like, but according to:

https://superuser.com/questions/1807763/inexplicable-the-system-cannot-find-the-drive-specified-how-to-solve-it

and

https://stackoverflow.com/questions/19843849/unexpected-the-system-cannot-find-the-drive-specified-in-batch-file

that error message can occur when something in a batch file is mis-interpreted as a drive path, not what it's supposed to be.

I'm sure some are the result of a failing drive, but when perfectly good hosts on modern drives are throwing it (without any subsequent errors in other tasks), it seems worth pulling the "weird corner case in a batch file" thread a bit. I'd have assumed it was purely a failing drive error message too, but... apparently not.
ID: 70851 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 814
Credit: 13,663,992
RAC: 8,399
Message 70852 - Posted: 20 Apr 2024, 15:48:31 UTC - in response to Message 70851.  
Last modified: 20 Apr 2024, 15:48:55 UTC

The batch file examples are down to misunderstanding how wildcarding works. We don't use batch files in the Windows apps. Also, if we had an error like that it would fail probably every time.

I guess since most people use the default install of boinc so it ends up on the C: drive. If that gets busy due to other activities a boinc process drive access might time out. Particularly because the boinc processes run at a lower priority. I'm just guessing, but maybe running in a VM might make this error more likely? (assuming it's not a hardware issue of course).

The other Windows related error we see (about 15-20% of task fails), is "Invalid control block address". When I looked this up it seemed to be related to Windows Update doing something. I didn't read too far once I knew it wasn't a problem I needed to fix :D. But it's not obvious to me why Windows Update should cause an issue to a running task? Maybe someone who knows Windows better than me might have an idea. I'd be interested to know if it's potentially recoverable.
---
CPDN Visiting Scientist
ID: 70852 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1067
Credit: 16,546,621
RAC: 2,321
Message 70853 - Posted: 20 Apr 2024, 17:26:09 UTC - in response to Message 70852.  

I guess since most people use the default install of boinc so it ends up on the C: drive. If that gets busy due to other activities a boinc process drive access might time out. Particularly because the boinc processes run at a lower priority.


I am not having problems with the two 1015 tasks on my Windows-10 machine. They have now uploaded 10 trickles each.
You will see that It has a small amount of RAM. It has only one drive and it is solid state (machine was too small to put a spinning hard drive in it).

The machine is set up to run up to 7 Boinc tasks at a time. And it is currently doing that. I only bought the machine to run TaxAct once a year, and I finished that about March 15. Four times a year I run Garmin Express to update the maps in the GPS for my car. So rather that waste the machine the rest of the year, I run Boinc on it. According to my UPS, it costs me $0.93/day for the electricity to run it.

It is another story for my big Linux machine, but I have not gotten any CPDN work for it since last June.

Computer 1512658
Computer information

CPU type 	GenuineIntel
11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz [Family 6 Model 140 Stepping 1]
Number of processors 	8

Operating System 	Microsoft Windows 10
Core x64 Edition, (10.00.19045.00)
BOINC version 	7.24.1
Memory 	15.64 GB
Cache 	256 KB

ID: 70853 · Report as offensive     Reply Quote
SolarSyonyk

Send message
Joined: 7 Sep 16
Posts: 259
Credit: 32,069,627
RAC: 28,154
Message 70854 - Posted: 20 Apr 2024, 23:21:53 UTC - in response to Message 70852.  

The batch file examples are down to misunderstanding how wildcarding works. We don't use batch files in the Windows apps. Also, if we had an error like that it would fail probably every time.


Unless it's dependent on some particular sequence in the name. Sorry, I don't know enough Windows to reason about it deeply. But my point is mostly that this particular error message can be caused by things that are not a hardware failure on the disk, and it may be worth trying to see if something in the proximity of the failure is doing something dumber-than-desired with disk access strings.


If that gets busy due to other activities a boinc process drive access might time out. Particularly because the boinc processes run at a lower priority.


Does Windows do that? It seems a particularly harsh error for a low priority access, especially as it kills the process. "I'll get to you when I get to you" is a lot more standard for low priority tasks under heavy disk IO, they just block on the disk IO until there's a gap to fill them. On Linux, at least, you'll see a very high "iowait" time for a process, but it won't kill it if it can't service the disk requests.

I'm just guessing, but maybe running in a VM might make this error more likely? (assuming it's not a hardware issue of course).


*shrug*

I just used 'winsat disk -drive c' to test the performance of my Win10 VM, which is the only VM, on a dedicated compute rig doing nothing else, and it reported 253MB/s for 16kb random read, 2542 MB/s for 64kb sequential read, and 1377 MB/s for 64kb sequential write. I doubt disk IO is timing out on that box.

I don't know what the failure rates are for Windows in general - it may be a low enough overall failure rate that it's not worth running down.
ID: 70854 · Report as offensive     Reply Quote

Message boards : Number crunching : Batch 1015 Discussion/problems

©2024 climateprediction.net