climateprediction.net home page
Posts by Richard Haselgrove

Posts by Richard Haselgrove

1) Message boards : Number crunching : Batch 1008, and test batches 1009 to 1014 for Windows - issues (Message 70847)
Posted 6 days ago by Richard Haselgrove
Post:
Task 22425060, test batch 1014, Intel has finished successfully.
2) Message boards : Number crunching : Batch 1015 Discussion/problems (Message 70830)
Posted 9 days ago by Richard Haselgrove
Post:
Slightly off topic, but related to current issues.

I've been sent a batch 1007 resend:

wah2_eas25_a1cu_199312_24_1007_012266614_1

The previous user had an Intel i9, but only managed three trickles in a fortnight. My i5 will probably run through it in 8 - 9 days, but is it worth it?
3) Message boards : Number crunching : Batch 1008, and test batches 1009 to 1014 for Windows - issues (Message 70814)
Posted 13 days ago by Richard Haselgrove
Post:
Another oddity is that the two most recent trickles - the ones you're referring to - both refer to the same timestep.

It might be helpful if you could look in BOINC's "Event Log" around the times the two last trickles were recorded. Each trickle should show up in two ways:

12/04/2024 05:25:35 | climateprediction.net | Sending scheduler request: To send trickle-up message.
12/04/2024 05:25:53 | climateprediction.net | Started upload of wah2_eas25_a00l_201312_24_1014_012276657_0_r1583651690_2.zip
12/04/2024 05:30:43 | climateprediction.net | Finished upload of wah2_eas25_a00l_201312_24_1014_012276657_0_r1583651690_2.zip (99389892 bytes)
- a scheduler request, and an upload file. How many do you see around that time?

Glenn may need to work out whether the duplication was caused by the application, or by the post-processing on the server. (One possibility is that the task was interrupted close to the 'trickle point', and restarted from an earlier checkpoint.)
4) Message boards : Number crunching : Batch 1008, and test batches 1009 to 1014 for Windows - issues (Message 70812)
Posted 14 days ago by Richard Haselgrove
Post:
My batch 1014 test task has passed the first trickle point and is continuing to crunch. Good news.
5) Message boards : Number crunching : Batch 1008, and test batches 1009 to 1014 for Windows - issues (Message 70807)
Posted 14 days ago by Richard Haselgrove
Post:
As announced in the 2024 new work thread, there's another test batch today.

I've got wah2_eas25_a00l_201312_24_1014_012276657_0, so I'll amend the thread title accordingly. It's running a little slowly, because of changes I've made locally to accommodate another project - I can amend those to speed this task up, without risking a restart.
6) Message boards : Number crunching : Batch 1008, and test batches 1009 to 1014 for Windows - issues (Message 70757)
Posted 19 days ago by Richard Haselgrove
Post:
Hmmm. I wouldn't be totally sure about that one. A windows app which fails to start because of a missing DLL usually bombs out with:

- exit code -1073741515 (0xc0000135)
and the generic description is "The application failed to initialize properly".

BOINC (and hence the BOINC library which is linked into the app or the wrapper) has it's own set of error codes, which you can find at:

https://github.com/BOINC/boinc/blob/master/lib/error_numbers.h

They include both positive and negative values, so I'd suspect both of these, in addition to the MS Windows numbers:

#define EXIT_SIGNAL 193 // app was killed by signal
#define ERR_INVALID_EVENT -193

That doesn't get us much further forward, but I'm still in brainstorming mode.
7) Message boards : Number crunching : Batch 1008, and test batches 1009 to 1014 for Windows - issues (Message 70753)
Posted 19 days ago by Richard Haselgrove
Post:
Looks like both of these errored out as well.
Yes, and at exactly the same place.

The stdout_mon.txt file for 22424396 ends with:

...
wah2_eas25_n01t_200912_24_1009_012276361 - PH 1 TS 0011616 A - 02/01/2010 00:00 - H:M:S=0007:43:16 AVG= 2.39 DLT= 1.15
wah2_eas25_n01t_200912_24_1009_012276361 - PH 1 TS 0011617 P - 01/01/2010 00:05 - H:M:S=0007:43:21 AVG= 2.39 DLT= 5.10
wah2_eas25_n01t_200912_24_1009_012276361 - PH 1 TS 0011618 P - 01/01/2010 00:10 - H:M:S=0007:43:29 AVG= 2.39 DLT= 8.17
Model crash detected, will try to restart...
The other machine is in a different room, and I'll check it later.
8) Message boards : Number crunching : Batch 1008, and test batches 1009 to 1014 for Windows - issues (Message 70749)
Posted 20 days ago by Richard Haselgrove
Post:
Starting to get some tasks from batch 1009 - I assume these are the test run.

So far, got tasks 22424380 and 22424396.

Not seeing them on the server status page yet, but that doesn't update in real time.
9) Message boards : Number crunching : Batch 1008, and test batches 1009 to 1014 for Windows - issues (Message 70747)
Posted 20 days ago by Richard Haselgrove
Post:
Four of my six machines have now got resends from the 2nd April batch, but there's no sign of the test batch yet. I'll keep these out of circulation for the time being, until and unless Glenn can give us a more precise ETA.

The trouble is that if our clients get consistent "no tasks" replies from the server, they stop asking (or at least, they ask less frequently). BOINC doesn't really take the needs of this type of test into account.
10) Message boards : Number crunching : Batch 1008, and test batches 1009 to 1014 for Windows - issues (Message 70745)
Posted 20 days ago by Richard Haselgrove
Post:
I'm restarting work fetch on my 6 Windows machines, on a 10-minute stagger and with a limit of one per machine - that should maximise my chances of being one of the 'select 100'.

Edit - and the next one in line got a task. Unfortunately, like Dave's, it's a resend from the previous (failing) run.

Glenn, should I keep it, or send it straight back? It's the third copy, so should kill the workunit if I abort it.

Workunit 12273481
11) Message boards : Number crunching : Batch 1008, and test batches 1009 to 1014 for Windows - issues (Message 70733)
Posted 21 days ago by Richard Haselgrove
Post:
Yes, phenom2, all Ryzen and thread ripper CPUs support SSE4.2
I think I had a vague memory of SSE4a, but that'll be ancient history for current processors.
12) Message boards : Number crunching : Batch 1008, and test batches 1009 to 1014 for Windows - issues (Message 70723)
Posted 22 days ago by Richard Haselgrove
Post:
Brainstorming it through with myself, could it be a compiler switch gone rogue? Might you inadvertently be compiling it with optimisations that work on AMD chips only, inserting opcodes the are valid for AMD but aren't available on Intel?
13) Message boards : Number crunching : Batch 1008, and test batches 1009 to 1014 for Windows - issues (Message 70718)
Posted 22 days ago by Richard Haselgrove
Post:
I second that opinion. Just seen my final two fail - that's 12 out of 12, all on Intel - and it includes the machine that processed the dev site task that failed for Dave.
14) Message boards : Number crunching : Batch 1008, and test batches 1009 to 1014 for Windows - issues (Message 70708)
Posted 22 days ago by Richard Haselgrove
Post:
My Windows 11 laptop has now crashed its two tasks as well, at the same place. I've held back the out.zip file for the time being, in case it's any use, but it sounds like the offline debug run will be a better bet.

Edit - in view of the reply, I'll let them go. Holding back on even requesting new tasks until we get the go-ahead.
15) Message boards : Number crunching : Batch 1008, and test batches 1009 to 1014 for Windows - issues (Message 70704)
Posted 22 days ago by Richard Haselgrove
Post:
A sample from one of mine:

Task 22416552

std_err:
executeModelProcess: MonID=4880, GCM_PID=3548, RCM_PID=4548
04:27:51 (4548): called boinc_finish(193)
Global Worker:: CPDN process is not running, exiting, bRetVal = T, checkPID = Controller4548
stdout_mon.txt:
...
wah2_eas25_n0ko_201012_24_1008_012272000 - PH 1 TS 0011616 A - 02/01/2011 00:00 - H:M:S=0008:25:01 AVG= 2.61 DLT= 1.22
wah2_eas25_n0ko_201012_24_1008_012272000 - PH 1 TS 0011617 P - 01/01/2011 00:05 - H:M:S=0008:25:07 AVG= 2.61 DLT= 5.41
wah2_eas25_n0ko_201012_24_1008_012272000 - PH 1 TS 0011618 P - 01/01/2011 00:10 - H:M:S=0008:25:16 AVG= 2.61 DLT= 9.38
Model crash detected, will try to restart...
Slight garble in std_err, but seems to be the same thing.

My machines are all Intel hardware, mostly running Windows 7 Professional x64 without any emulation layer. My Windows 11 laptop also got two tasks, which are still running - I'll keep an eye on them.
16) Message boards : Number crunching : Batch 1008, and test batches 1009 to 1014 for Windows - issues (Message 70700)
Posted 22 days ago by Richard Haselgrove
Post:
I started 8 tasks yesterday, starting soon after release (2 tasks each on four machines, all Intel i5). Sample task:

wah2_eas25_n2nl_201712_24_1008_012274697_0

The only clue I can see so far is:

Controller:: CPDN process is not running, exiting, bRetVal = T, checkPID = 2136, selfPID = 7432, iMonCtr = 2
Model crash detected, will try to restart...
Global Worker:: CPDN process is not running, exiting, bRetVal = T, checkPID = 2136, selfPID = 4880, iMonCtr = 2
All failed after about 8 hours, round about where the first trickle would have been expected.

I'll have another look round in more detail later.

Edit - looks like they didn't send either a credit trickle or a data trickle. But this machine did send an out.zip, which may help.
17) Message boards : Number crunching : Should full credit be given for time on non successful tasks? (Message 70689)
Posted 24 days ago by Richard Haselgrove
Post:
I looked around a bit an several projects seem to have had people with this issue. If for any reason the benchmark for your machine is optimistic it can cause this error. Manually rerunning benchmarks should solve it for future tasks but not those already downloaded.
I've been toying with asking BOINC if they need to re-validate the benchmarking code for these massive CPUs. The benchmark is supposed to saturate all available cores - 32 threads, in this case - and use the average for a single-thread task, but my head hurts when I try to read the code.
18) Message boards : Number crunching : Should full credit be given for time on non successful tasks? (Message 70688)
Posted 24 days ago by Richard Haselgrove
Post:
Sure - well, I'll give it a try, anyway.

You're on the right lines with your maths. The error message is actually badly worded. BOINC doesn't really deal with time (because it's meant to cope with computers with widely differing speeds), so the key figure is the estimated 'size' of any given task. That's expressed in terms of the number of floating point arithmetic calculations the task will take to complete, as <rsc_fpops_est> in the description of each task. That figure is set by the project team for each task type: it's the one thing they have total control of. I think this project gets that one pretty much correct - it was 3,801,388 billions of operations for the last WaH2 batch I looked at.

From that basic figure, the BOINC server calculates a <rsc_fpops_bound> - by default, ten times larger than the estimate. Some projects in the past have got the estimate badly wrong, and pushed up the bound to 100x or even 1000x to escape from their own error, but I'd urge against that.

The other factor in the time limit is the speed of the computer. This is where host 1548623 went wrong.

That computer was only attached to the project on 16 Jan 2024, and we haven't had very much work since then. So the only information BOINC has available is the machine's self-reported benchmark. That's currently reported as 47.15 billion ops/sec, but it may have been slightly different when the task you're looking at was allocated on 29 Feb 2024. That could be a random fluctuation - not significant.

But look a bit further down the host page, at the Application details for that computer. It has processed tasks for application versions 8.24 and 8.29, at 4.93 GFLOPS and 5.26 GFLOPS respectively. Those are much more realistic values, but BOINC will be ignoring them completely.

The figures are real, actual, values, calculated from tasks running on that individual computer and reaching a successful conclusion. But BOINC doesn't trust them until it has a minimum of 11 completed, valid, tasks. I suspect this machine will vanish from the project long before it reaches that target - it only has one qualifying task so far for v8.29.

We can only guess at the motivation of the user who attached the machine to the project in the first place. The processor was introduced in late 2019, so it could be up to 4 years old: maybe it's a rebuild or rescued machine, and he wanted to test it out? If so, it wasn't a very well designed test, and scientifically idiotic.
19) Message boards : Number crunching : Should full credit be given for time on non successful tasks? (Message 70683)
Posted 29 days ago by Richard Haselgrove
Post:
I think CPDN is a particularly difficult case for BOINC. Although it seems straightforward from the outside, this is the sort of project that doesn't respond well to the "Fish some second-hand heavy metal out of a skip, power it up, turn all the knobs up to 11, and walk away" approach to crunching.

I've mentioned host 1549227 on the board before: this week, host 1548623 came to my attention as well. They both fall into the 'Heavy Metal' category, with 16 thread and 32 thread capacity respectively. But they're a bit skinny on memory, with just 1GB per thread, and the hyperthreading will hit floating point speed something rotten. You can't just walk away from a machine like that, and think "job done".

Part of the problem is that BOINC Central - the main developers - seem to have adopted an approach, over the last 10 years or so, that human volunteers just get in the way: the computers know how it all works, right out of the box, and can be left to work it out for themselves. The most egregious example of this, of course, is Science United.

But BOINC isn't perfect, and doesn't cope perfectly under all conditions. Any programmer will know that instinctively, even if they don't want to talk about it in polite company. Glenn drew my attention to 1548623, and asked if I could throw some light on the tasks which had been aborted with 'EXIT_TIME_LIMIT_EXCEEDED'. I ran the numbers yesterday, and the problem turns out to be that BOINC has benchmarked the CPU at 47.15 GFlops. That's something that CPDN can't cure, and I strongly suspect that BOINC won't cure either (and because the machines are anonymously registered, we - as users - have no way of making contact and offering to help).

No, I think that the original idea of BOINC - to get volunteers interested and involved in the science, as well a enjoying the competition and the social sides - was on the right lines. But there's a lot of noise out there competing for our attention.
20) Message boards : Number crunching : Top participants RAC (Message 70665)
Posted 23 Mar 2024 by Richard Haselgrove
Post:
And his computers are hidden, too - so we have no idea what operating system he's running, and hence what task types he's likely to have processed.

But you're right - there was a credit glitch last summer. The earliest IFS runs for Linux weren't fully credited in real time, under the old system. When the new system was activated for the first time, everyone's RAC was calculated as if the work had all been processed on a single recent day. That made for a huge spike in RAC. He may have decided to retire from this project and rest on his laurels at that point ...


Next 20

©2024 climateprediction.net