Possibly lost task result and other problems

Author	Message
EveningStarNM Send message Joined: 27 Oct 13 Posts: 14 Credit: 29,828 RAC: 0	Message 50166 - Posted: 14 Sep 2014, 16:19:57 UTC I wanted to find out if the computer time I have given to CPDN was worth anything, so I spent some time examining task results. Notably, of the six tasks I'd downloaded recently, four failed with "Error while computing" (17020157, 17020615, 17020630, and 17020637). Since others have said those applications frequently fail, I've disabled them in my preferences, wondering why CPDN is wasting time distributing tasks that are unsuitable for so many BOINC volunteers. I also discovered that one of my tasks from last March, 16161297, has status "Timed out - no response". However, that task is listed among my "Credit Pending" tasks (yes, I know that doesn't mean there's any actual credit pending), which makes it seem like it was completed. Or maybe not? Did the two of us that returned trickles return any results that are useful in any way? It's interesting that it's a 3m8m Coupled Model Full Resolution Ocean model that didn't fail immediately. In fact, of the five systems that have tried to run that WU, two returned trickles, although both did suffer ultimate failures. Unfortunately, Stderr reports don't appear to be returned with the trickles, so it's difficult to know what happened on those machines (one of which was one of mine, but I can't find the log for that WU from so long ago). I'm hoping that something might be revealed in the trickle that will explain why the WU did not fail immediately on two Windows 7 machines, perhaps providing a clue to the CPDN programmers about how to get more useful results from this project. But I was most interested in seeing if one of my current tasks that appeared to run to completion 17020616, was a useful expense of 30 hours of CPU time. Looking at the report page, I'm not sure. There are several issues: 1) Only one "trickle" was returned, at timestep 25,920, CPU time 60,604, even though the total CPU time for the task was only 50,955.87 secs. 2) The "Validate state" for the task is perpetually "initial". 3) No credit was claimed or granted. 4) The "Outcome" was "Success", the "Client state" was "done", and the "Exit status" was 0x0. 5) There is NOTHING in the Stderr log except a long, looooonnnng, series of "Suspended CPDN Monitor - Suspend request from BOINC..." messages. Have the results from that effort been lost? I'm still running an ANZ task, and it's got about 161 hours left to run. I'm considering aborting it. But before I do, I want to know if /anything/ I can do for CPDN is actually useful, or if I'm just wasting my time. So far, I am not encouraged by the results of my efforts. ID: 50166 · Reply Quote

Iain Inglis Volunteer moderator Send message Joined: 16 Jan 10 Posts: 1081 Credit: 6,981,170 RAC: 3,836	Message 50178 - Posted: 15 Sep 2014, 12:03:27 UTC The HADCM3S model as currently configured submits two trickles and two Zip files, but only one trickle appears on the Web site. If it's marked as a success it's a success. CPDN does not validate in the usual BOINC sense - i.e. require tasks to agree with each other. The models are simply too numerically complicated for that. The nuances of 'claimed' and 'granted' don't apply to CPDN because it doesn't validate. The credits will appear when the credit script is run. The number of credits is being reviewed for that particular model type. The scientific results do not appear in the stderr log. That log provides information collected from the running of a model on a particular machine and will therefore vary from machine to machine. The log you mention simply reports that the model is constantly being suspended as you use the computer, which is the default BOINC setting. If you want to see the kind of analysis the project produces then have a look at the project's publication page. The papers there are credible, appropriate and appear in respected journals. That's as good as it can get for a BOINC project. ID: 50178 · Reply Quote

EveningStarNM Send message Joined: 27 Oct 13 Posts: 14 Credit: 29,828 RAC: 0	Message 50190 - Posted: 15 Sep 2014, 22:14:29 UTC - in response to Message 50178. Last modified: 15 Sep 2014, 22:19:03 UTC Thank you for your reply, Iain. The HADCM3S model as currently configured submits two trickles and two Zip files, but only one trickle appears on the Web site. If it's marked as a success it's a success. Thanks for that explanation. Do the trickles contain any log information that can be used for debugging? While that would be more useful for the project's programmers, it might also be useful if users can alter their BOINC configurations to accommodate CPDN. CPDN does not validate in the usual BOINC sense - i.e. require tasks to agree with each other. The models are simply too numerically complicated for that. Then I suppose we can expect the validate state to always be "Initial". That's understandable. The nuances of 'claimed' and 'granted' don't apply to CPDN because it doesn't validate. The credits will appear when the credit script is run. The number of credits is being reviewed for that particular model type. Okay. I assume that credit is granted if useful results are returned regardless of the number of credits granted. All I'm really interested in is that the number of credits granted is greater than zero. That's really the only clue we have that what we're doing is beneficial. The scientific results do not appear in the stderr log. That log provides information collected from the running of a model on a particular machine and will therefore vary from machine to machine. The log you mention simply reports that the model is constantly being suspended as you use the computer, which is the default BOINC setting. That's exactly the kind of information that I'm interested in. I don't expect to see scientific results in STDERR, but I do want know about the system status. That machine, for instance, does nothing but run BOINC full time, and BOINC is set to always run tasks at 85% of processor capacity even when it's in use. BOINC is its only job. It doesn't even have a keyboard, mouse, or monitor attached to it. It's controlled through an RDP session. (I have too many computers at home, so I figure I might as well put the extras to work for BOINC projects). Tasks should never be suspended, and STDERR doesn't give any clues about why the suspensions were requested or how long the suspensions lasted. I'm a bit puzzled by this. I'll be grateful if you can offer any ideas about why those suspend requests might be made. Unfortunately, I did not have that machine set to keep tasks in memory when suspended, and I read a comment that suggested that might cause problems when checkpoints are saved. Could that have been the reason? If you want to see the kind of analysis the project produces then have a look at the project's publication page. The papers there are credible, appropriate and appear in respected journals. That's as good as it can get for a BOINC project. I expect that those papers would be even more informative if more useful results were returned. Hopefully, the bugs in the failure-prone applications will be worked out soon. ID: 50190 · Reply Quote

Iain Inglis Volunteer moderator Send message Joined: 16 Jan 10 Posts: 1081 Credit: 6,981,170 RAC: 3,836	Message 50191 - Posted: 15 Sep 2014, 23:17:36 UTC - in response to Message 50190. Last modified: 15 Sep 2014, 23:19:01 UTC 1. The trickles do not contain any information that is normally visible to users. There have been models that sent temperature and other information in trickles which was then displayed on a chart - which provided both a source of interest and occasional diagnostic information. The current generation of models do not provide such feedback and the trickles therefore provide diagnostic information only through their presence or absence: the HADCM3N model, for example, sometimes crashed at ten year intervals, which was apparent from the trickle record. 2. The number of credits on CPDN is intended to reflect the processing power used to finish the model and the CPDN beta testers run early versions of the model to ensure the differences between model run times don't affect the rate at which credits are awarded. No judgment is made, however, about the scientific merit of each model type: they are all assumed to be equally worthwhile. 3. The contents of the stderr log take a bit of familiarisation. Windows, Mac and Linux will differ; BOINC versions may differ; Intel/AMD processors may differ. Luckily the suspension entries are straightforward to explain. By default BOINC attempts to disrupt the host computer as little as possible by suspending the science application whenever it detects that the host computer is doing something. There is a setting in BOINC Manager that controls the percentage activity at which suspension occurs: that value has to be set to zero to stop the suspensions. Your stderr logs will then stop being swamped by suspension entries. 4. Setting the "keep applications in memory" is generally recommended, particularly for models that have a long interval between checkpoints (at which the model state is saved). If the application is kept in memory then it will continue from where it left off after a suspension rather than returning to the checkpoint and loading the model state back from file, which is quicker and more reliable. 5. CPDN applications have four main components, which go wrong in the following order: (1) the underlying Met Office Hadley model (never that we get to hear about), (2) BOINC (it has been known), (3) the CPDN wrapper for the HAD-series models (quite a lot), and (3) configuration files (a lot - and sometimes catastrophically, such as with the BBC Climate Change Experiment, which had to be restarted from scratch). Fortunately, item #4 is the easiest to fix, once spotted. ID: 50191 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 943 Credit: 34,183,823 RAC: 6,640	Message 50192 - Posted: 15 Sep 2014, 23:25:29 UTC - in response to Message 50190. That's exactly the kind of information that I'm interested in. I don't expect to see scientific results in STDERR, but I do want know about the system status. That machine, for instance, does nothing but run BOINC full time, and BOINC is set to always run tasks at 85% of processor capacity even when it's in use. BOINC is its only job. It doesn't even have a keyboard, mouse, or monitor attached to it. It's controlled through an RDP session. (I have too many computers at home, so I figure I might as well put the extras to work for BOINC projects). Tasks should never be suspended, and STDERR doesn't give any clues about why the suspensions were requested or how long the suspensions lasted. I'm a bit puzzled by this. I'll be grateful if you can offer any ideas about why those suspend requests might be made. Unfortunately, I did not have that machine set to keep tasks in memory when suspended, and I read a comment that suggested that might cause problems when checkpoints are saved. Could that have been the reason? BOINC's thermal throttling is really very, very crude. Your "run tasks at 85% of processor capacity" will result in Run for eight seconds, pause for 1 second Run for eight seconds, pause for 2 seconds or something like that. Those will be the model suspensions, and - especially without LAIM being set - will really hammer your CPDN contribution: the Met Office's Fortran code really isn't designed to be treated like that. You would be far better off limiting the number of CPU cores that BOINC is allowed to use to something like 75%, but allowing the cores in use to run at 100%. That will allow the climate models to run much more smoothly, and your operating system will re-distribute the work between the available hardware processing units to alleviate any thermal stress between physical 'cores' that you might be worried about. ID: 50192 · Reply Quote

EveningStarNM Send message Joined: 27 Oct 13 Posts: 14 Credit: 29,828 RAC: 0	Message 50194 - Posted: 16 Sep 2014, 0:34:47 UTC - in response to Message 50192. BOINC's thermal throttling is really very, very crude. Your "run tasks at 85% of processor capacity" will result in Run for eight seconds, pause for 1 second Run for eight seconds, pause for 2 seconds or something like that... You would be far better off limiting the number of CPU cores that BOINC is allowed to use to something like 75%, but allowing the cores in use to run at 100%... Wow. I definitely misunderstood that "Use at most X% CPU time" setting. I kinda hate taking one of the cores out of the queue, but it seems prudent. Thanks for the information. ID: 50194 · Reply Quote