climateprediction.net home page
Redundancy

Redundancy

Message boards : Number crunching : Redundancy
Message board moderation

To post messages, you must log in.

AuthorMessage
DJStarfox

Send message
Joined: 27 Jan 07
Posts: 300
Credit: 3,288,263
RAC: 26,370
Message 33382 - Posted: 17 Apr 2008, 3:20:34 UTC

Take a look at this workunit I just finished:
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6150217

There are three finished tasks and two pending. One was listed as \"didn\'t need\". I understand that BOINC server is capable of sending a \"kill trickle\", as I recall this feature used with an erroneous software version last year. My question: is this the normal number of completed work units (3)? Seems like a bit of waste with 3 results already finished, why need two more? Also, the \"max # of error/total/success results\" seems really low if you want 3 results back for each model.

I know that reliable volunteer computers for CPDN are scarce, so I don\'t want to be crunching redundant work.
ID: 33382 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 33385 - Posted: 17 Apr 2008, 5:25:55 UTC
Last modified: 17 Apr 2008, 5:27:06 UTC

Multiple appearances of models to different computers are for reliability tests, sometimes to see how a given dataset works on different cpu maths units, and sometimes to ensure that a result is returned at least once (possibly several times), when something interesting has been found, and \"they\" want another look at the results.
Don\'t sweat it; \"they\" know what \"they\'re\" doing when \"they\" issue a dataset several times.

And don\'t forget - a lot of other projects do the same thing, except there it\'s called a quorum.

\"max # of error/total/success results\" is there because it\'s used by projects with quorums. This project ignores it. The same as a lot of entry positions are ignored. (Remember \"Deadline\"?)
So you can ignore this part too.


Backups: Here
ID: 33385 · Report as offensive     Reply Quote
Profile Iain Inglis

Send message
Joined: 9 Jan 07
Posts: 467
Credit: 14,549,176
RAC: 317
Message 33387 - Posted: 17 Apr 2008, 10:11:32 UTC
Last modified: 17 Apr 2008, 10:18:21 UTC

One thing that has mystified me about the work unit format is the \"didn\'t need\" outcome. All the work units I\'ve looked at are issued in a batch immediately: the one DJStarfox mentions has a spread of 1 hour 30 minutes. No single model could have completed in that time, so the expected logic of \"this model isn\'t needed because there are already enough completed models\" doesn\'t apply. So, I\'ve never figured out what the logic of that is. Sometimes half the models in a unit aren\'t issued.

The three complete models in DJStarfox\'s work unit are: Intel/Windows x 2, AMD/Linux x 1. If you overlay the temperature results for the two Intel/Windows results, they match; but the AMD/Linux result is slightly different.

My guess at the quality control logic is:

1. Non-matching platforms ==> estimate platform sensitivity

2. Duplicate result on same platform ==> estimate result reliability

3. Triplicate result on same platform ==> exclude unreliable result (excessive overclocking etc.)

4. 4 x result on same platform ==> one of them \"wasted\"?

However, #4 is very improbable given the high casualty rate, and there may be other variables of interest (e.g. how do other parameters, such as RAM, affect probability of completion - which isn\'t scientifically interesting, but might help set minimum PC specifications).

The quality controls in #1-3 are very important to the project as I suspect CPDN has to answer questions about this quite a lot (from people with a super-computers). They published a very readable paper about it: Knight et al..
ID: 33387 · Report as offensive     Reply Quote
Profile old_user5994

Send message
Joined: 31 Aug 04
Posts: 239
Credit: 2,933,299
RAC: 0
Message 33391 - Posted: 17 Apr 2008, 12:24:55 UTC - in response to Message 33382.  

I know that reliable volunteer computers for CPDN are scarce, so I don\'t want to be crunching redundant work.


A *LOT* of science is about doing the same experiment over and over again.

The reason we don\'t have \"Cold Fusion\" powered cars is quite simply that no one could duplicate the experimental results. Thusly, Cold Fusion by that process is not possible.

Other reasons are the ones provided by Les and Iain, ....
ID: 33391 · Report as offensive     Reply Quote
Profile Iain Inglis

Send message
Joined: 9 Jan 07
Posts: 467
Credit: 14,549,176
RAC: 317
Message 33396 - Posted: 17 Apr 2008, 17:06:47 UTC - in response to Message 33387.  
Last modified: 17 Apr 2008, 17:08:14 UTC

One thing that has mystified me about the work unit format is the \"didn\'t need\" outcome. All the work units I\'ve looked at are issued in a batch immediately: the one DJStarfox mentions has a spread of 1 hour 30 minutes. No single model could have completed in that time, so the expected logic of \"this model isn\'t needed because there are already enough completed models\" doesn\'t apply. So, I\'ve never figured out what the logic of that is. Sometimes half the models in a unit aren\'t issued...

Bad form to quote your own posts, but I\'ve now figured it out ...

Results are marked as \"didn\'t need\" if any result already issued crashes before all the results in the work unit are issued. This wouldn\'t happen if the work unit numbers were set appropriately. Until recently, Macs would instantly crash slabs - so if a Mac attempted a slab download, the rest of the work unit would be wrongly marked as \"didn\'t need\". The Mac problem has been fixed now.
ID: 33396 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 943
Credit: 34,388,593
RAC: 5,030
Message 33397 - Posted: 17 Apr 2008, 17:57:24 UTC - in response to Message 33396.  

One thing that has mystified me about the work unit format is the \"didn\'t need\" outcome. All the work units I\'ve looked at are issued in a batch immediately: the one DJStarfox mentions has a spread of 1 hour 30 minutes. No single model could have completed in that time, so the expected logic of \"this model isn\'t needed because there are already enough completed models\" doesn\'t apply. So, I\'ve never figured out what the logic of that is. Sometimes half the models in a unit aren\'t issued...

Bad form to quote your own posts, but I\'ve now figured it out ...

Results are marked as \"didn\'t need\" if any result already issued crashes before all the results in the work unit are issued. This wouldn\'t happen if the work unit numbers were set appropriately. Until recently, Macs would instantly crash slabs - so if a Mac attempted a slab download, the rest of the work unit would be wrongly marked as \"didn\'t need\". The Mac problem has been fixed now.

I don\'t think it\'s exactly that, but close.

My guess is that the \"max # of error/total/success results\" is coming into play. We intelligent humans can say \'ok, that doesn\'t apply to CPDN\', and ignore it: but BOINC, being a stupid machine, will do as it\'s told. \"maximum number of results, in total, ONE. I\'ve got a result back. I\'m done. No need to send out any more.\"

If the administrators could set max errors, max results, max success, all to be the same as the Initial Replication (in the case of the example at the beginning of this thread, 10 / 10 / 10 / 10), then quirks like this unsent result would be avoided - and there\'d be two fewer error messages to explain to new crunchers on the board, as well.
ID: 33397 · Report as offensive     Reply Quote
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 33404 - Posted: 18 Apr 2008, 0:47:46 UTC

I wonder whether the fact that some models are called a \'Success\' when they were clearly unsuccessful has any effect on what the server does next. Or the fact that some models are labelled \'Done\' (which is supposed to mean that no error occurred) when there was clearly an error.

What astonishes me more than anything else is how many people blithely crash hundreds of models without posting to ask for advice. I currently have a list of 7 and will be taking action in agreement with the mods and admins. DJStarfox, it must be very rare for CPDN to have 3 completed models from the same workunit to compare.
Cpdn news
ID: 33404 · Report as offensive     Reply Quote
Profile old_user5994

Send message
Joined: 31 Aug 04
Posts: 239
Credit: 2,933,299
RAC: 0
Message 33405 - Posted: 18 Apr 2008, 1:48:54 UTC

It is not really the fault of the participants to not know. If you look at the \"simple\" interface, how would they know?

But, even with the more complex interface, it still takes some training and time before you know what is going on ...

Even though I was once pretty competent with BOINC two years ago, there is a surprisigly large amount of information that we take for granted that we will understand ...

Though Rummy took a lot of flack for his statement, it is true that there are the known knowns, the unknown knowns, the known unknowns and the unknown unknowns ...

The last one is the killer, most people don\'t see or know (or maybe even care) how different CPDN is from other BOINC projects. Though I have to say that with the LARGE field out there it is getting to be a little less alone for strangeness in the way it does business ...
ID: 33405 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 33406 - Posted: 18 Apr 2008, 1:57:18 UTC

And the original poster has been noticeably conspicuous by his absence since his post.

ID: 33406 · Report as offensive     Reply Quote
Profile old_user5994

Send message
Joined: 31 Aug 04
Posts: 239
Credit: 2,933,299
RAC: 0
Message 33407 - Posted: 18 Apr 2008, 4:21:56 UTC - in response to Message 33406.  

And the original poster has been noticeably conspicuous by his absence since his post.


So, you don\'t like talking to me ... <sob> ... :)
ID: 33407 · Report as offensive     Reply Quote
old_user170894
Avatar

Send message
Joined: 3 Mar 06
Posts: 96
Credit: 353,185
RAC: 0
Message 33408 - Posted: 18 Apr 2008, 4:29:05 UTC - in response to Message 33405.  

It is not really the fault of the participants to not know. If you look at the \"simple\" interface, how would they know?

But, even with the more complex interface, it still takes some training and time before you know what is going on ...


It\'s become far too complicated for the average newbie. If they just leap in with no training they drown. And I\'m afraid they need more than just a Wiki and forums to refer to when they start to sink. They need to ease into it slowly, progress in steps, get hands on experience at each step before progressing to the next.

After they run the installer and before they see the list of projects that accompanies the \"You are not attached to any projects\" message, they should be directed to Project Interactive Boot Camp where they progress through levels of interactive training and finally earn their wings. It could be something like this...

Level 1 at Project Boot Camp feeds them a few 2 to 4 minute perfect work units so they can see how things are supposed to work. A slide presentation introduces them to basic DC concepts, the basic controls in the manager and what the various WU statuses mean. Otherwise they have to guess and experiment and that can be frustrating if they are not the intuitive type. After the WUs upload and report, they are taken on a tour of Project Boot Camp\'s website and their results page. The quorum, validation and pending credits are explained.

Level 2 feeds them a few curves. They get a few short WUs that are less than ideal. The progress bar doesn\'t move, the time to completion increments rather than decrements, the CPU time stays at 0 or jumps erratically. 1 or WUs crash. Again, after the results have uploaded and reported they are taken to their results page and new concepts are introduced and discussed.


ID: 33408 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 33409 - Posted: 18 Apr 2008, 4:55:24 UTC

I\'ve long felt that this is how it should be done.
But I doubt that it will happen.

ID: 33409 · Report as offensive     Reply Quote
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 33411 - Posted: 18 Apr 2008, 11:03:42 UTC
Last modified: 18 Apr 2008, 11:21:57 UTC

Yes, an initial BOINC training exercise would be an excellent idea. Has this ever been suggested in a Trac ticket?

I agree that the whole business of running BOINC + project(s) is complicated. Once you know your way around it mostly seems intuitively easy, but to paraphrase what Dr David Anderson said somewhere, BOINC is almost like an operating system in itself. And that\'s before considering the extra layers of projects and their tasks added on top of BOINC.

The vocabulary still makes things unnecessarily difficult to understand. In my own current BOINC manager messages I still have to make a mental effort to understand what the BOINC client is. In line 4 the data directory must seem like a foreign language to many people. The computer is still being referred to as the host, whereas I thought that following the Trac ticket I contributed to about BOINC vocabulary, the computer would now always be referred to as the computer. Network activity might be more easily understood if called \'internet activity\'. I don\'t think many newbies can have a clear idea of what or where the scheduler is.

\'Requesting 0 seconds of work, reporting 0 completed tasks\' sounds as if something may have gone wrong. I think the phrase BOINC previously used \'Not requesting new...\' etc was probably better.

It\'s not at all clear that clicking X in the BOINC manager GUI is not the same as exiting from BOINC.

Tasks are now being consistently called tasks, not results, which is an improvement.

The earlier Trac ticket on BOINC nomenclature met with a positive response from Berkeley, so maybe I\'ll open a new one with a few more suggestions. It might be a good idea for me first to open a thread about it on this forum to allow my suggestions to be picked to pieces here first, and to let other members add extra suggestions. Best to do this soon so that any improvements can be included in version 6.

I\'m aware that the vocabulary is just part of the overall problem.

Cpdn news
ID: 33411 · Report as offensive     Reply Quote
Profile old_user5994

Send message
Joined: 31 Aug 04
Posts: 239
Credit: 2,933,299
RAC: 0
Message 33415 - Posted: 18 Apr 2008, 15:11:30 UTC - in response to Message 33411.  

I\'m aware that the vocabulary is just part of the overall problem.

In BOINC beta I argued for BOINC COnsole and Plug-in for the BOINC Manager and Science Applications because we could then use the game console that people use as a mental model...

I was shot down because, as I have also long argued, the geeks in us cannot seem to grasp the fact that not everyone is as fascinated with this stuff as we might be ...
ID: 33415 · Report as offensive     Reply Quote
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 33419 - Posted: 18 Apr 2008, 18:18:12 UTC

I hope nobody minds that I\'ve moved John Hopkinson\'s post from this thread to a thread of its own here so his problem gets full attention. John will receive an email to inform him.
Cpdn news
ID: 33419 · Report as offensive     Reply Quote
DJStarfox

Send message
Joined: 27 Jan 07
Posts: 300
Credit: 3,288,263
RAC: 26,370
Message 33433 - Posted: 19 Apr 2008, 0:06:06 UTC - in response to Message 33406.  

And the original poster has been noticeably conspicuous by his absence since his post.


My silence is not an indication of absence.

In response to Iain, my machine was the AMD/Linux computer that did the WU (linked in the original post). I\'ve known for a while that AMD & Intel have slightly different floating point designs on their chips, so that may account for a variation.

My main points were to point out the \"didn\'t need\" result and the fact that 3 computers returned successful results for this WU, which seemed quite rare (above normal) for most CPDN WU. Mo.V pointed this out, thanks. A little validation is always a good thing in science, but I thought having 3 results was an unintended number to reach.

I was very curious why the server realized it didn\'t need the extra result, even before any were finished. Of the few tasks my computer has finished, most of the time I returned the only successful result. Is this still usable to the scientists?

I can see that this inevitably brought up the discussion of how many tasks are aborted or never finish once users see how long they are taking to run. I wish there was an easy solution, but there is definitely not. I think one of the inherit risks of distributed computing--using computers over the internet--is that the network (and the nodes) are unreliable. BOINC was programmed to deal with that, but CPDN will have to work extra hard because of the steep computing requirements of the project. As BOINC improves, all projects should benefit.
ID: 33433 · Report as offensive     Reply Quote
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 33435 - Posted: 19 Apr 2008, 1:23:18 UTC

Hi again DJ

There\'s never been any indication from Oxford that a single completed task from a workunit can\'t be used. Quite the contrary, as otherwise the same WUs would have to be sent out again. They may well be resent if none complete.

AFAIK all completed tasks (regardless of how many tasks in the WU completed) are subjected to quality control. I think the main aim of this is to root out models that show instabilities in the processing, which in most cases will be caused by overzealous overclocking. Back in the days of CPDN Classic, the quality control failure rate was I think 2-3%.

Even if two or three models from a WU complete, I wouldn\'t be surprised if they\'re all used at least for some purposes. Some of the research is done to improve future modelling design (not just to predict what the climate will be like), so the range of what the researchers need must be pretty wide.

The checking methods used by some other projects where they have to find which computers got the sums right and which got them wrong - canonical results and so on - just don\'t apply to climate models because of the inherent variabilities/uncertainties that increase over time.
Cpdn news
ID: 33435 · Report as offensive     Reply Quote

Message boards : Number crunching : Redundancy

©2024 climateprediction.net