New work discussion

Author	Message
Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4504 Credit: 18,450,004 RAC: 1,042	Message 69153 - Posted: 7 Jul 2023, 5:24:16 UTC Fresh batch of 120month spin up tasks has been released on testing. We are looking at 25 days to completion for the three on my Ryzen so looking at about a month before the new batch of main site tasks if all goes well with them. ID: 69153 ·

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1030 Credit: 16,107,573 RAC: 15,433	Message 69155 - Posted: 7 Jul 2023, 8:32:07 UTC - in response to Message 69153. Last modified: 7 Jul 2023, 8:38:21 UTC Fresh batch of 120month spin up tasks has been released on testing. We are looking at 25 days to completion for the three on my Ryzen so looking at about a month before the new batch of main site tasks if all goes well with them. That is a batch of test spinup workunits from the current failing batch that have already failed on my machine. I'm not sure why you picked up the resends. CPDN are going to discuss with the scientists about redefining the region for the limited area model. We think the size of it is causing the segv. I'm still suspicious that your WINE implementation avoids the segv by sandboxing the environment, which is why your tasks are running. I'll check with Sarah as I think you can abort those. I'll PM you. --- CPDN Visiting Scientist ID: 69155 ·

Helmer Bryd Send message Joined: 16 Aug 04 Posts: 156 Credit: 9,035,872 RAC: 2,928	Message 69159 - Posted: 7 Jul 2023, 13:03:25 UTC Hi! Where are those Big OIFS work units that was talked about earlier? Me upgraded my systems but haven't seen any. ID: 69159 ·

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1030 Credit: 16,107,573 RAC: 15,433	Message 69160 - Posted: 7 Jul 2023, 13:08:51 UTC - in response to Message 69159. Hi! Where are those Big OIFS work units that was talked about earlier? Me upgraded my systems but haven't seen any. You must have read my mind, was going to post an update today! It's taken longer than expected for U.Oxford to sort out my visiting scientist post, which will now start beginning of Oct. That will give me login access to their systems so I can properly run tests etc. CPDN also had to unexpectedly move off their existing servers and install new ones which added another delay. So come the autumn, we'll be getting back to OpenIFS testing with high resolution, multicore, batches. The scientist who ran the Baroclinic Lifecycle experiments is also keen to do some high resolution work in the autumn as well. Currently I'm looking at a problem we found where some model output files have get lost when they are returned. --- CPDN Visiting Scientist ID: 69160 ·

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4504 Credit: 18,450,004 RAC: 1,042	Message 69161 - Posted: 7 Jul 2023, 14:13:14 UTC - in response to Message 69155. I'm still suspicious that your WINE implementation avoids the segv by sandboxing the environment, which is why your tasks are running. I'll check with Sarah as I think you can abort those. I'll PM you. Thanks Glen. Now aborted. so if fresh spin ups required when the changes are made, a bit longer probably till the fixed batch arrives. ID: 69161 ·

Mr. P Hucker Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918	Message 69164 - Posted: 7 Jul 2023, 18:11:13 UTC I wish someone would tell me if I'm to keep running these WAH Windows tasks. Not interested in credits, interested in if they will help the scientists. ID: 69164 ·

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4504 Credit: 18,450,004 RAC: 1,042	Message 69182 - Posted: 9 Jul 2023, 7:58:39 UTC - in response to Message 69164. I wish someone would tell me if I'm to keep running these WAH Windows tasks. Not interested in credits, interested in if they will help the scientists. Sorry for the delay. Yes the results will be used by the scientists as part of designing the next batch which hopefully will be less prone to failures. ID: 69182 ·

Mr. P Hucker Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918	Message 69184 - Posted: 9 Jul 2023, 11:23:06 UTC - in response to Message 69182. Last modified: 9 Jul 2023, 11:27:02 UTC Yes the results will be used by the scientists as part of designing the next batch which hopefully will be less prone to failures. I wish they'd fix the computation error on Windows restart problem, I've just busted all the 24 tasks I have. I did as instructed, I suspended them, waited 2 minutes, closed Boinc, waited 2 minutes, then rebooted. Upon continuing them, every single one crashed. What on earth is it doing to cause this problem? There must be an easy fix. It's wasting a lot of computation cycles, burning fossil fuels for a project which is supposed to be against it. It's the most recent ones here: https://www.cpdn.org/results.php?hostid=1509739&offset=0&show_names=0&state=6&appid= ID: 69184 ·

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4504 Credit: 18,450,004 RAC: 1,042	Message 69185 - Posted: 9 Jul 2023, 11:41:38 UTC - in response to Message 69184. There must be an easy fix. I think if the fix was that easy, it would have been sorted long ago. My personal preference would be to go over completely to the OIFS tasks which don't have the problem. They were written though from the ground up. I have been crunching for CPDN since 2009 with this ID and before that on an ID I lost when my ISP got taken over. The issue has been around on both LInux and Windows tasks for as long as I have been with the project. Before the days of multiple cores, I would take backups that could be restored before a reboot but the procedure for that is much more complicated with multiple tasks running, even if only running one project. ID: 69185 ·

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1030 Credit: 16,107,573 RAC: 15,433	Message 69186 - Posted: 9 Jul 2023, 14:18:02 UTC - in response to Message 69184. I wish they'd fix the computation error on Windows restart problem, I've just busted all the 24 tasks I have. I did as instructed, I suspended them, waited 2 minutes, closed Boinc, waited 2 minutes, then rebooted. Upon continuing them, every single one crashed. What on earth is it doing to cause this problem? There must be an easy fix. I wish there was a fix too just like I wish there was less moaning on these forums, but I suspect one will be easier to fix than the other. They fail on restarts and suspend/resume on my machine too and CPDN are aware of this. As you know CPDN has scant resources mostly spent 'firefighting' this year rather than focussing on issues like this. It also doesn't seem to be a problem for all machines for some reason. The model throws away the useful logs before it returns them to the server, which means someone needs to set up a local test, add code to print more information, to pin down what's causing the problem. So it's several weeks of work at least. Although it's something I could do I prefer to finish working through the current issues with OpenIFS first and Andy has his hands full with more pressing issues. --- CPDN Visiting Scientist ID: 69186 ·

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1030 Credit: 16,107,573 RAC: 15,433	Message 69187 - Posted: 9 Jul 2023, 14:31:45 UTC - in response to Message 69185. I think if the fix was that easy, it would have been sorted long ago. My personal preference would be to go over completely to the OIFS tasks which don't have the problem. That won't happen because WaH and OIFS have different capabilities. WaH is a nested model; a global that drives a much higher resolution regional model. OpenIFS is only a global model. Although it could run at the resolution the regional model is using it could only do that globally and the memory requirements for that would need a compute cluster. OpenIFS tasks have the same problem but not as often. If the restart files (or 'checkpoint' in boinc lingo) are corrupted or not written properly, OpenIFS will fail to restart. We saw this happen with the last batches that went out. From the little i know of WaH it handles it's restart files different to OIFS which closes them after each write. WaH I believe keeps the files open, which could mean they are not flushed to disk properly on shutdown. That would be where I'd start looking. --- CPDN Visiting Scientist ID: 69187 ·

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 1051 Credit: 36,341,855 RAC: 2,973	Message 69188 - Posted: 9 Jul 2023, 15:18:23 UTC - in response to Message 69187. Last modified: 9 Jul 2023, 15:18:54 UTC OpenIFS tasks have the same problem but not as often. If the restart files (or 'checkpoint' in boinc lingo) are corrupted or not written properly, OpenIFS will fail to restart. We saw this happen with the last batches that went out. From the little i know of WaH it handles it's restart files different to OIFS which closes them after each write. WaH I believe keeps the files open, which could mean they are not flushed to disk properly on shutdown. That would be where I'd start looking. Couldn't we help with that? It might be a write error on closedown, or a read error on restart. If you have a task to run, watch to see what happens on checkpoint (set <checkpoint_debug> in the Event Log). If the program closes the files, the timestamp will change - if no files have a new timestamp, they were being kept open. If the timestamp changes, wait until just after a checkpoint, and copy all the new files to somewhere outside BOINC's control while the program is still running. [If you can't copy them, then I'm wrong about when the timestamp changes, and they're being kept open] If you catch a set, keep them safe and post their vital statistics here: names, original location, bytecount, perhaps even first and last few lines if they can be rendered in human-readable format. Then, shut down BOINC, and restart it - report whether yours is a 'crasher' or a 'runner'. [another reason for only testing one task at a time - and perhaps preferably early in the run] But if you can't help, please don't waste time by posting the same complaint over and over again. They know! ID: 69188 ·

Alan K Send message Joined: 22 Feb 06 Posts: 489 Credit: 30,625,891 RAC: 3,476	Message 69191 - Posted: 9 Jul 2023, 22:18:07 UTC - in response to Message 69112. Still having problems with the out file though the 25th zip has gone. Out stuck at 59% and it is going to upload7. ID: 69191 ·

rob Send message Joined: 5 Jun 09 Posts: 96 Credit: 3,614,983 RAC: 2,400	Message 69230 - Posted: 11 Jul 2023, 13:35:29 UTC After a run of wah2 ea25 tasks all of which failed in the first couple of minutes I've now got wah2 nz25 task which has run for 27 minutes and counting. Fingers and toes crossed for the next few days running and a successful conclusion. ID: 69230 ·

Mr. P Hucker Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918	Message 69235 - Posted: 11 Jul 2023, 13:51:39 UTC - in response to Message 69230. Last modified: 11 Jul 2023, 13:51:57 UTC After a run of wah2 ea25 tasks all of which failed in the first couple of minutes I've now got wah2 nz25 task which has run for 27 minutes and counting. Fingers and toes crossed for the next few days running and a successful conclusion. Just don't restart the computer. I hope you have a UPS. I've lost all 30 non-non-starters to that problem. ID: 69235 ·

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1030 Credit: 16,107,573 RAC: 15,433	Message 69238 - Posted: 11 Jul 2023, 15:34:31 UTC - in response to Message 69230. After a run of wah2 ea25 tasks all of which failed in the first couple of minutes I've now got wah2 nz25 task which has run for 27 minutes and counting. Fingers and toes crossed for the next few days running and a successful conclusion. The NZ25 tasks use a different configuration; a much smaller grid for the regional model so don't suffer the same problem as the EAS25 config. Smaller domains for the EAS25 batch are currently being tested. --- CPDN Visiting Scientist ID: 69238 ·

rob Send message Joined: 5 Jun 09 Posts: 96 Credit: 3,614,983 RAC: 2,400	Message 69240 - Posted: 11 Jul 2023, 16:28:22 UTC - in response to Message 69238. Thanks Glenn, I had a feeling there was a difference between the EAS & NZ data sets, but wasn't certain. It will be interesting to see how the "new, smaller" EAS data sets behave - hopefully better than the "old large" ones that have been so much trouble for so many (I'm so glad mine all failed in a couple of minutes, no several days). ID: 69240 ·

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1118 Credit: 17,163,134 RAC: 2,081	Message 69241 - Posted: 11 Jul 2023, 17:27:47 UTC - in response to Message 69230. After a run of wah2 ea25 tasks all of which failed in the first couple of minutes I've now got wah2 nz25 task which has run for 27 minutes and counting. Fingers and toes crossed for the next few days running and a successful conclusion. After a run of a couple of weeks ago, all tasks failing, my pipsqueak Windows10 box just got a new CPDN task, and instead of failing after about 4 minutes, it has now run for almost 25 minutes, with a little over 10 days predicted to go. ID: 69241 ·

rob Send message Joined: 5 Jun 09 Posts: 96 Credit: 3,614,983 RAC: 2,400	Message 69242 - Posted: 11 Jul 2023, 17:44:31 UTC - in response to Message 69235. I will be stopping and restarting the computer, so I'll keep you posted. It will be interesting as there are a couple of differences, one is I'm only allowing CPDN to run a single task just now, and also these tasks are from a different data set to those that were failing in minutes. p.s. Still running after 4.5 hours, but fingers and toes still crossed for the next 9 days of run time which is more like a couple of weeks clock time. ID: 69242 ·

Glenn Carver Send message Joined: 29 Oct 17 Posts: 1030 Credit: 16,107,573 RAC: 15,433	Message 69243 - Posted: 11 Jul 2023, 18:19:48 UTC - in response to Message 69242. I will be stopping and restarting the computer, so I'll keep you posted. It will be interesting as there are a couple of differences, one is I'm only allowing CPDN to run a single task just now, and also these tasks are from a different data set to those that were failing in minutes. p.s. Still running after 4.5 hours, but fingers and toes still crossed for the next 9 days of run time which is more like a couple of weeks clock time. I've had trouble with WAH2 NZ tasks before. Failed after restarting from a sleep. Your experience may vary. It's a known issue the restart is problematic. To avoid a task fail, I alter the computing options at night to 20% cpu instead of 100% and only have 1 task running at a time. Also change the power options in Windows to 'energy efficient' from performance. Bit tedious but it does get the power usage of the PC down to hopefully keep the batteries going a little longer. ID: 69243 ·

New work discussion - 2