Open letter to CPDN Management: 2200+ CPU hours discarded?

Author	Message
old_user17389 Send message Joined: 13 Sep 04 Posts: 16 Credit: 1,178,331 RAC: 0	Message 24028 - Posted: 18 Aug 2006, 23:57:37 UTC This is a follow up to a thread I created on July 20, named \"missing \'final\' file\". In it I described a looping condition in the model date, error messages in the BOINC manager, and not receiving any apparent credit. Messages in the BOINC manager said â€œthere is no â€˜finishedâ€™ fileâ€ and I â€œmay need to reset the projectâ€. No one on the Help Desk could tell me what a finished file was or whether resetting the project was the only option. Someone said it was a problem in the BOINC core client, not the CPDN software, but I had no way of knowing with what authority he spoke. Someone else said the problem was harmless and rebooting would fix it. I rebooted but the looping and errors continued and I finally gave up and reset, throwing away over 1100 CPU hours of work. Now I am faced with the same situation on another model on another machine. Same messages, same looping, same lack of credit, rebooting did not help again. I have 1165 hours invested in my current model. Iâ€™m looking at another reset here unless someone at the project can give me clear guidance what else to do. I think CPDN does important work but I think this is a huge problem. I cannot help thinking that these long models are not worth the risk of a crash after so much time has been invested. I would like reassurances from the project leadership that you recognize the problem and are taking steps to address it. Regards, Tom Stepka ID: 24028 · Reply Quote

MikeMarsUK Volunteer moderator Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0	Message 24035 - Posted: 19 Aug 2006, 9:45:12 UTC Last modified: 19 Aug 2006, 10:17:31 UTC Tom, The \"Task exited with zero status but no \'finished\' file\" message has a number of different causes, most of which are harmless, and, more rarely, some which are very bad (i.e., permanent looping). It can be hard to tell them apart, particularly when we can\'t see the list of uploaded trickles (your computers are hidden). The exit which is caused by time changes is due to the Boinc client, in my view this is a bug. Exits which are caused by the system being busy are deliberate (not a bug), and also triggered by the boinc client. The permanent looping problem is due to the science app, and has been solved in the most recent version of the code (5.15). What happens here is that the model reaches an \'impossible\' climate (for example, negative atmospheric pressure), and reprocesses the last day to see if it was a calculation error. If the last day preprocess doesn\'t work, then it reprocesses the last month, and finally it reprocesses the last year. If it still gets a bad climate after that, it is supposed to automatically abort the model, but instead keeps retrying the year. The reason some models reach an impossible climate can be due to either a bad calculation earlier in the model\'s life, or alternatively because the initial climate parameters don\'t lead to a viable climate. One of the main aims of the project is to find which sets of parameters are viable, and which are not viable, so models which fail in this way are of much interest to the scientists. Note also that the modelling work done until the point it started looping is not wasted (the yearly, decade and 40-year uploads will have been sent to the servers). It is perfectly fair to say that once looping has started, the CPU time is being wasted. You also quite reasonably ask about the relationship of the various people who responded to you, to the project: I\'m a fellow participant, like yourself, although I also help out in some of the forums as a moderator. Ditto for Les and astroWX. Keck Komputers is also a fellow participant, and answers a lot of questions in the Q&A board. Tolu is the only person who actually works for the project (he is one of the two programmers). http://bbc.cpdn.org/forum_thread.php?id=1573&nowrap=true#12452 \"Task exited with zero status but no \'finished\' file\" Firstly, don\'t reset the project! These messages are \'usually harmless\'. The model should automatically restart based on the last checkpoint (checkpoints are written every 5 model days, about 15 minutes of processing on typical machines). If this message happens regularly (every 3 or 4 hours), it\'s probably due to a \'Windows time sync\' which seems to cause problems on some machines. To test this, right-click on the clock in the system tray, select \'adjust time\', select \'internet time\', then \'update now\'. If the model immediately falls over and resumes with the \'zero exit\' message, you have found the culprit. ghogan has been kind enough to experiment and discovered that it happens if the clock is set back slightly (setting it forward doesn\'t have this effect). So if your on-board clock is running slightly fast, you\'ll see these zero status exit\'s quite frequently, because the time sync will keep setting it back. You can disable the timesync, or change the frequency at which it happends (it\'s set to timesync once per week on my PC). Disabling the time sync is easy, but changing it\'s frequency involves using RegEdit to modify the system registry, which is not recommended unless you are confident with it. The following key needs to be modified. HKEY_LOCAL_MACHINE\\SYSTEM\\CurrentControlSet\\ Services\\W32Time\\TimeProviders\\NtpClient\\SpecialPollInterval (set to 604800, which is one week in seconds) For more information on this, see the last part of this article. A similar problem occurs on some nForce2 motherboards, exactly every 5 hours, see the following forum posting: message no finished file. In other cases, it can be caused by running something which uses a lot of system resources, for example games, image/video manipulation programs, and so forth (on my own PC, VMWare and Google Video cause it). If it keeps happening, and your model does not progress, then make a note of the \'model date\' at which it crashes. If it repeatedly crashes on the same model date, then you may have to call it a day, and abort the model. -- (Edit: I\'ve duplicated some of Les\'s answers in subsequent edits) I'm a volunteer and my views are my own. News and Announcements and FAQ ID: 24035 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 24036 - Posted: 19 Aug 2006, 9:55:53 UTC Tom In your previous thread you had a reply from Tolu. Tolu is one of the two programmers on the project, so you\'ve already had reassurances from part of the project leadership. Hiding your computers doesn\'t help those of us who help out here, to help you with problems. It may cover up any work computers so that your boss won\'t recognize them, but there isn\'t any sensitive information that other crunchers can see. Just click on my name to the left, and compare what you see with what you see of your own account. They\'re different. ID: 24036 · Reply Quote

old_user17389 Send message Joined: 13 Sep 04 Posts: 16 Credit: 1,178,331 RAC: 0	Message 24046 - Posted: 19 Aug 2006, 21:24:55 UTC Hi, and thanks for your considered reply. Hereâ€™s my current situation. I have resumed calculation on my model. I am running hadcm3lb_5.08,l and will continue to monitor progress and model date. Checked my windows time update option â€“ internet updates were enabled. I did an â€œupdate nowâ€ using two different web sites and got a generic error message for each that sync did not occur. I have now turned off automatic time synching. Meanwhile there were no messages in the BOINC manager window. I have modified my preferences to unhide my computers for CPDN. Please let me know if you can see them. I had no idea they had been invisible to project workers. The Computer ID in question 413075. I see from my account page the last trickle was July 18 and that I have received ~7700 credits. I take it this means I have not received any credit for the past 30 days. This computer does nothing but run CPDN and SETI apps. The hardware is maybe three years old and it runs a fully updated Windows XP pro. You say, â€œIf it keeps happening, and your model does not progress, then make a note of the \'model date\' at which it crashes. If it repeatedly crashes on the same model date, then you may have to call it a day, and abort the model.â€ Is the modelâ€™s crash date the point at which it cycles backwards? Short of watching the graphic window 24/7 is there a way to determine this time after the fact? Is there any other information I can supply you? Regards, Tom Stepka ID: 24046 · Reply Quote

MikeMarsUK Volunteer moderator Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0	Message 24047 - Posted: 19 Aug 2006, 22:10:06 UTC Last modified: 19 Aug 2006, 22:14:10 UTC Hi Tom, If you observe the model date go backwards more than 3 times within the same year, then you can say for certain that it\'s looping. From your description, it does sound very much like it is. The timestep of the last trickle (the one on the 18th) indicates that the model was at 1955 at that point - if it was running normally (not suspended), it should be many years after that point by now. As you can see from the graphs, http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=5407433 the servers have been updated with the models progress from it\'s start to 1955, so the processing work until that point has been worthwhile to the project, although it\'s fair to say that if it\'s been looping, the time since then hasn\'t done much good. If our fears are realised and you do see it going back and forth over the same year, then the best thing to do is to abort it, and a new model will replace it. The new model should be version 5.15 (although there are still some old ones kicking around), which does not suffer from the looping problem (it\'ll abort itself instead in the same circumstance). If you would like some background information on why some models are unstable, then take a look at the following links : http://www.climateprediction.net/science/strategy.php#param and some presentations from : http://www.climateprediction.net/project/OpenDay2006.php Basically the scientists are searching for \'plausible\' models, and the way they do this is try lots of different combinations and see which ones look reasonable and which ones are unreasonable. Incidentally, the temperature rise you can see in the graph hints towards the model being unstable - a quick rise followed by a levelling out is OK, but a sustained rise indicates that the model parameters may be unbalanced. Having said that, I\'ve seen graphs which look wild but ran to completion (for example http://climateapps2.oucs.ox.ac.uk/cpdnboinc/proj_stat.php?app_index=3), so the rise isn\'t proof in itself. -Cheers, Mike I'm a volunteer and my views are my own. News and Announcements and FAQ ID: 24047 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 24049 - Posted: 19 Aug 2006, 22:25:15 UTC Actually, checking your model errors hasn\'t helped much at all. But as for the looping, if you look through the messages, (full archive in stdoutdae.txt in the BOINC folder), you can get an idea of how long a model year is, by checking the hours/days between trickles. Then you\'ll just need to check the current month and year in the globe info a few times at intervals less than this, and you\'ll soon see if the same month / year keeps showing up. It SHOULD auto-abort when it reaches the same \'certain point\' the second time through; but if it jumps back to the start of the same year soon after this point, then it\'s looping. If the problem it\'s re-testing was just caused by a hardware / software hiccup, then the next time at that point, it should get past it, and continue on into the future. ID: 24049 · Reply Quote

old_user17389 Send message Joined: 13 Sep 04 Posts: 16 Credit: 1,178,331 RAC: 0	Message 24054 - Posted: 20 Aug 2006, 13:55:36 UTC It is nice to know that not all my work on the two problem models was in vain. Thank you for the explanations of what was going on and how to get at that information. I have aborted my model and will attempt to get another. I will be attempting to get another model, but if it is not version 5.15 then I will abort it immediately and keep trying. I have had my fill of 5.08. Once again, thanks for your help! Regards, Tom ID: 24054 · Reply Quote

old_user1 Send message Joined: 5 Aug 04 Posts: 907 Credit: 299,864 RAC: 0	Message 24061 - Posted: 22 Aug 2006, 15:57:53 UTC - in response to Message 24054. sorry about the problems you\'ve had -- any work you have done past 10 years is useful to us so it\'s not all \"down the drain.\" up to the year 2000 is actually the most important/exciting to scientists to see how certain param combinations track historically. Any new model you get should be version 5.15. ID: 24061 · Reply Quote

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 24150 - Posted: 31 Aug 2006, 2:18:31 UTC Anyone with both an Athlon and a Pentium and a backup from before the looping started might like to see how Pete B got his model through the loop: http://www.climateprediction.net/board/viewtopic.php?t=5575&highlight= Cpdn news ID: 24150 · Reply Quote