Message boards :
Number crunching :
Sorting for platform
Message board moderation
Author | Message |
---|---|
Send message Joined: 15 May 09 Posts: 4355 Credit: 16,598,247 RAC: 6,156 |
I read a while ago that with famous, if one task crashes on a particular OS/CPU combination it is likely that the rest of the tasks from that work unit will do likewise, and the converse that if one task succeeds with a particular combination then the others are likely to do so. Would it be possible to get a greater number of models through by seeing what happens with the first ones of a batch to go out and then sending models to where they are most likely to complete? I know there must be other criteria around sending out models which may make this impossible/too time (computer or human) intensive for this to work and am sure others have thought of this but I haven't seen it discussed here.... |
Send message Joined: 16 Jan 10 Posts: 1081 Credit: 7,127,935 RAC: 3,075 |
Unless I'm missing something, it ought to be quite easy to arrange. If you're a Mac/Linux user and try to request a HADAM3P regional model, then no model will be supplied because there is no Mac/Linux application (at the moment). So, suppose three related 'applications' were created each supporting one platform instead of the current system of one 'application' supporting three platforms. Three identical sets of work units could be created with restrictive 'initial replication' etc. Each platform cohort would work its way through its WUs independently. To avoid the complaint that minor platforms are just reproducing the work already done by major platforms then adjust the WU generation process so that each platform starts from a different place in the master WU list (e.g. Linux from the top down, Mac from the middle up). All the WUs would eventually be covered by each platform. Then add result validation, not as a credit allocation method, but as a work allocation strategy to prevent multiple identical completions and the efficiency of CPDN would be transformed. Or perhaps not. |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
I agree with Iain. If the concept of 'trusted computer' (which almost always completes its tasks and generates valid results) could be added, this would eliminate more duplication. A trusted computer would have the only task from a workunit. I don't know whether Boinc has already adopted this concept or it's just an idea for future development. Cpdn news |
Send message Joined: 15 May 09 Posts: 4355 Credit: 16,598,247 RAC: 6,156 |
That was I think what I was groping towards. - I wanted to float it to see if I was missing something obvious. That and I had noticed that sometimes there are four or five tasks go out from a work unit, all to the same platform and all have failed. I also wondered if tasks that crash on linux could then get tried on Windows or Mac as well as the converse? |
Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0 |
For what it's worth, WCG provides capability for Projects to use the "trusted computer" technique; see Single Validation – Type 1: http://www.worldcommunitygrid.org/help/viewTopic.do?shortName=points#174 No indication that boinc is involved; my guess is that it's IBM/WCG server code. "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. |
Send message Joined: 5 Aug 04 Posts: 108 Credit: 20,514,432 RAC: 22,931 |
For what it's worth, WCG provides capability for Projects to use the "trusted computer" technique; see Single Validation – Type 1: While both "Adaptive replication" and "need_reliable" was developed by WCG, it's been part of standard BOINC-code for a long time. While "Adaptive replication" is great for min_quorum = 2 projects there it can reduce the average task/wu from 2.xx to around 1.05 - 1.10 task/wu, CPDN is using min_quorum = 1 so "Adaptive replication" can't reduce this any further. Then it comes to "need_reliable", this could be an advantage, since you're guaranteed any re-issue is only sent to "Reliable" computers that has fast turnaround-times, so chances are the re-issue will be returned fairly fast. It's also possible to set wu-priority so high initially on wu-generating that they "need_reliable". But, the big problem with Famous is that appart for some computers that routinely error-out all wu's, most Famous-errors is wu-specific, so a "Reliable" computer will give the same error... Also worth remembering is, for a computer to become "Reliable", it must have enough Validated results, but CPDN has never used a validator... |
Send message Joined: 6 Aug 04 Posts: 264 Credit: 965,476 RAC: 0 |
My Linux box has errored 3 Famous tasks. The fourth has been running for 150 hours and is still running. CPU is AMD Opteron 1210, Linux is SuSE 11.1. Tullio |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
Yes, the FAMOUS error rate probably makes the detection of reliable computers impossible. And any computer that's run a lot of slabs will also very probably have had iceworlds. Would failed downloads make a computer unreliable? Cpdn news |
Send message Joined: 5 Aug 04 Posts: 108 Credit: 20,514,432 RAC: 22,931 |
Yes, the FAMOUS error rate probably makes the detection of reliable computers impossible. And any computer that's run a lot of slabs will also very probably have had iceworlds. Well, without a validator no computer will become reliable... :) But as far as download-errors is concerned, this will decrease the daily quota, and any computer with decreased daily quota is not "Reliable". Since the quota increases again on "success"-reports, if there's no other reasons for being unreliable, a computer can very quickly be back to "Reliable" again. The problem with FAMOUS is that if example 1st. copy is sent to an "intel + windows" and this gives an error, there's a fairly good chance a "Reliable" computer that gets the re-issue will also be an "intel + windows", and in most instances this means the exact same error. So being "Reliable" doesn't really mean much for FAMOUS, since it's the wu's themselves that is unstable, and not majority of computers. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
So. A non-BOINC script that scans the computer list based on project defined criteria. Backups: Here |
Send message Joined: 15 May 09 Posts: 4355 Credit: 16,598,247 RAC: 6,156 |
Worth checking what happened to other tasks in the work units. Having read some of the posts on this subject in the past, if one task in a work unit crashes with a particular cpu/os combination then the others will. Also the spinup models are more likely than others to crash but those that do work are used for generating more models. The spinup models all start with a year somewhere around 499. I have also had quite a few models crash but on looking here see that it is not a fault of the computer. Looking @ the work units the other tasks have also failed to complete, both with windows and with linux. For some the model itself is unstable and ends up with a negative value for air pressure. There are other impossible values that cause a crash also at which point the most useful thing for your computer to do is to report the problem and download another work unit. |
Send message Joined: 3 Oct 06 Posts: 43 Credit: 8,017,057 RAC: 0 |
But as far as download-errors is concerned, this will decrease the daily quota, and any computer with decreased daily quota is not "Reliable". Since the quota increases again on "success"-reports, if there's no other reasons for being unreliable, a computer can very quickly be back to "Reliable" again. There are no actual "success" tasks, right? Only completed ones. By that measure, a daily quotum can never recover. Does that mean a 'failure' on CPDN will not actually decrease the quotum for that host? Or the original quotum is restored. Hence maybe the problem with minussing hosts that do not stay minussed? |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
I think the problem with the minussed computers that do not remain minussed is something completely separate. This minussing/unminussing problem isn't linked to the usual quota mechanism. There must be some defect in the CPDN's Boinc server version that nobody has been able to identify. (If I am honest I have to say that this is not the only defect.) Even if CPDN had a validator that could identify reliable computers, as well as the HadSM and FAMOUS models that cannot complete on some types of computer because of inherent but usually unpredictable problems with certain parameter values, there's also the current problem of failed downloads because the server appears unable to cope continuously. I counted the successful and failed downloads for all the members who joined CPDN on one day a few days ago. 103 models downloaded successfully. 23 failed. In most cases this isn't the fault of the computer, but its daily quota is still reduced by 1 for each failed download. Cpdn news |
Send message Joined: 5 Aug 04 Posts: 108 Credit: 20,514,432 RAC: 22,931 |
There are no actual "success" tasks, right? Only completed ones. By that measure, a daily quotum can never recover. Does that mean a 'failure' on CPDN will not actually decrease the quotum for that host? Or the original quotum is restored. I didn't mean "success"-task, but "reported as success", since atleast with the "old-style" quota-code, every time a "success" was reported by client the quota was doubled (if wasn't already at max). So, "success"-report means "client reports the task has finished without any errors". If this later changes to invalid or something due to validator is another matter... BTW, apparently the web-pages has been changed, so task-status doesn't call it "success" any longer, but rather "Completed, waiting for validation" or another variant of "Completed...". As for how the "new-style" per-application quota-system works, haven't looked-up the new code, but atleast going by another project, the "pending"-tasks haven't changed the quota, only the Validated tasks has... But this obviously can't be the case for the server-code CPDN is using, since if it was, most CPDN-computers by now would sit with a quota of 1 per application. |
©2024 climateprediction.net