Questions and Answers :
Windows :
Open letter to CPDN Management: 2200+ CPU hours discarded?
Message board moderation
Author | Message |
---|---|
Send message Joined: 13 Sep 04 Posts: 16 Credit: 1,178,331 RAC: 0 |
This is a follow up to a thread I created on July 20, named \"missing \'final\' file\". In it I described a looping condition in the model date, error messages in the BOINC manager, and not receiving any apparent credit. Messages in the BOINC manager said “there is no ‘finished’ file†and I “may need to reset the projectâ€. No one on the Help Desk could tell me what a finished file was or whether resetting the project was the only option. Someone said it was a problem in the BOINC core client, not the CPDN software, but I had no way of knowing with what authority he spoke. Someone else said the problem was harmless and rebooting would fix it. I rebooted but the looping and errors continued and I finally gave up and reset, throwing away over 1100 CPU hours of work. Now I am faced with the same situation on another model on another machine. Same messages, same looping, same lack of credit, rebooting did not help again. I have 1165 hours invested in my current model. I’m looking at another reset here unless someone at the project can give me clear guidance what else to do. I think CPDN does important work but I think this is a huge problem. I cannot help thinking that these long models are not worth the risk of a crash after so much time has been invested. I would like reassurances from the project leadership that you recognize the problem and are taking steps to address it. Regards, Tom Stepka |
Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0 |
Tom, The \"Task exited with zero status but no \'finished\' file\" message has a number of different causes, most of which are harmless, and, more rarely, some which are very bad (i.e., permanent looping). It can be hard to tell them apart, particularly when we can\'t see the list of uploaded trickles (your computers are hidden). The exit which is caused by time changes is due to the Boinc client, in my view this is a bug. Exits which are caused by the system being busy are deliberate (not a bug), and also triggered by the boinc client. The permanent looping problem is due to the science app, and has been solved in the most recent version of the code (5.15). What happens here is that the model reaches an \'impossible\' climate (for example, negative atmospheric pressure), and reprocesses the last day to see if it was a calculation error. If the last day preprocess doesn\'t work, then it reprocesses the last month, and finally it reprocesses the last year. If it still gets a bad climate after that, it is supposed to automatically abort the model, but instead keeps retrying the year. The reason some models reach an impossible climate can be due to either a bad calculation earlier in the model\'s life, or alternatively because the initial climate parameters don\'t lead to a viable climate. One of the main aims of the project is to find which sets of parameters are viable, and which are not viable, so models which fail in this way are of much interest to the scientists. Note also that the modelling work done until the point it started looping is not wasted (the yearly, decade and 40-year uploads will have been sent to the servers). It is perfectly fair to say that once looping has started, the CPU time is being wasted. You also quite reasonably ask about the relationship of the various people who responded to you, to the project: I\'m a fellow participant, like yourself, although I also help out in some of the forums as a moderator. Ditto for Les and astroWX. Keck Komputers is also a fellow participant, and answers a lot of questions in the Q&A board. Tolu is the only person who actually works for the project (he is one of the two programmers). http://bbc.cpdn.org/forum_thread.php?id=1573&nowrap=true#12452 \"Task exited with zero status but no \'finished\' file\" -- (Edit: I\'ve duplicated some of Les\'s answers in subsequent edits) I'm a volunteer and my views are my own. News and Announcements and FAQ |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Tom In your previous thread you had a reply from Tolu. Tolu is one of the two programmers on the project, so you\'ve already had reassurances from part of the project leadership. Hiding your computers doesn\'t help those of us who help out here, to help you with problems. It may cover up any work computers so that your boss won\'t recognize them, but there isn\'t any sensitive information that other crunchers can see. Just click on my name to the left, and compare what you see with what you see of your own account. They\'re different. |
Send message Joined: 13 Sep 04 Posts: 16 Credit: 1,178,331 RAC: 0 |
Hi, and thanks for your considered reply. Here’s my current situation. I have resumed calculation on my model. I am running hadcm3lb_5.08,l and will continue to monitor progress and model date. Checked my windows time update option – internet updates were enabled. I did an “update now†using two different web sites and got a generic error message for each that sync did not occur. I have now turned off automatic time synching. Meanwhile there were no messages in the BOINC manager window. I have modified my preferences to unhide my computers for CPDN. Please let me know if you can see them. I had no idea they had been invisible to project workers. The Computer ID in question 413075. I see from my account page the last trickle was July 18 and that I have received ~7700 credits. I take it this means I have not received any credit for the past 30 days. This computer does nothing but run CPDN and SETI apps. The hardware is maybe three years old and it runs a fully updated Windows XP pro. You say, “If it keeps happening, and your model does not progress, then make a note of the \'model date\' at which it crashes. If it repeatedly crashes on the same model date, then you may have to call it a day, and abort the model.†Is the model’s crash date the point at which it cycles backwards? Short of watching the graphic window 24/7 is there a way to determine this time after the fact? Is there any other information I can supply you? Regards, Tom Stepka |
Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0 |
Hi Tom, If you observe the model date go backwards more than 3 times within the same year, then you can say for certain that it\'s looping. From your description, it does sound very much like it is. The timestep of the last trickle (the one on the 18th) indicates that the model was at 1955 at that point - if it was running normally (not suspended), it should be many years after that point by now. As you can see from the graphs, http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=5407433 the servers have been updated with the models progress from it\'s start to 1955, so the processing work until that point has been worthwhile to the project, although it\'s fair to say that if it\'s been looping, the time since then hasn\'t done much good. If our fears are realised and you do see it going back and forth over the same year, then the best thing to do is to abort it, and a new model will replace it. The new model should be version 5.15 (although there are still some old ones kicking around), which does not suffer from the looping problem (it\'ll abort itself instead in the same circumstance). If you would like some background information on why some models are unstable, then take a look at the following links : http://www.climateprediction.net/science/strategy.php#param and some presentations from : http://www.climateprediction.net/project/OpenDay2006.php Basically the scientists are searching for \'plausible\' models, and the way they do this is try lots of different combinations and see which ones look reasonable and which ones are unreasonable. Incidentally, the temperature rise you can see in the graph hints towards the model being unstable - a quick rise followed by a levelling out is OK, but a sustained rise indicates that the model parameters may be unbalanced. Having said that, I\'ve seen graphs which look wild but ran to completion (for example http://climateapps2.oucs.ox.ac.uk/cpdnboinc/proj_stat.php?app_index=3), so the rise isn\'t proof in itself. -Cheers, Mike I'm a volunteer and my views are my own. News and Announcements and FAQ |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Actually, checking your model errors hasn\'t helped much at all. But as for the looping, if you look through the messages, (full archive in stdoutdae.txt in the BOINC folder), you can get an idea of how long a model year is, by checking the hours/days between trickles. Then you\'ll just need to check the current month and year in the globe info a few times at intervals less than this, and you\'ll soon see if the same month / year keeps showing up. It SHOULD auto-abort when it reaches the same \'certain point\' the second time through; but if it jumps back to the start of the same year soon after this point, then it\'s looping. If the problem it\'s re-testing was just caused by a hardware / software hiccup, then the next time at that point, it should get past it, and continue on into the future. |
Send message Joined: 13 Sep 04 Posts: 16 Credit: 1,178,331 RAC: 0 |
It is nice to know that not all my work on the two problem models was in vain. Thank you for the explanations of what was going on and how to get at that information. I have aborted my model and will attempt to get another. I will be attempting to get another model, but if it is not version 5.15 then I will abort it immediately and keep trying. I have had my fill of 5.08. Once again, thanks for your help! Regards, Tom |
Send message Joined: 5 Aug 04 Posts: 907 Credit: 299,864 RAC: 0 |
sorry about the problems you\'ve had -- any work you have done past 10 years is useful to us so it\'s not all \"down the drain.\" up to the year 2000 is actually the most important/exciting to scientists to see how certain param combinations track historically. Any new model you get should be version 5.15. |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
Anyone with both an Athlon and a Pentium and a backup from before the looping started might like to see how Pete B got his model through the loop: http://www.climateprediction.net/board/viewtopic.php?t=5575&highlight= Cpdn news |
©2024 cpdn.org