DAMN. Another \'Unresolvable Error\' at 80%!!!

Author	Message
Pete McCann Send message Joined: 8 Mar 06 Posts: 24 Credit: 12,791,616 RAC: 0	Message 28989 - Posted: 26 May 2007, 11:25:06 UTC Well it is certainly not third time lucky. I\'ve just uploaded another model failure. Damn and Blast!!! It is on the same 4 core machine as last time. Computer ID 532553. Is it terminal Doc? At least I did get a BBC model to completion yesterday, and it\'s pair should complete today, so it is not all doom and gloom. Let me know about this one, as my last backup is about a week old. I\'m running a bit behind my normal regime. Cheers guys. Pete McCann. ID: 28989 · Reply Quote

Pete McCann Send message Joined: 8 Mar 06 Posts: 24 Credit: 12,791,616 RAC: 0	Message 28991 - Posted: 26 May 2007, 11:37:49 UTC - in response to Message 28989. I\'ve just tracked down the right page for this model. Looks like it\'s \'Negative pressure\' again. Boo Hiss. Did the model manage to make it to 2050? It will be fairly close. Are my other 2 models from this batch also likely to crash? All 4 were downloaded at the same time. Do I have a bad batch? Pete ID: 28991 · Reply Quote

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 28998 - Posted: 26 May 2007, 17:40:12 UTC Commiserations, Peter. I seem to remember that a BBC member restored a backup of a model that had crashed with this error (they\'re the same type of model). On the restored rerun, the model got through and continued successfully. This must mean that the error can occasionally be generated by a calculation/processing glitch on the computer. If you have a backup, you might like to try? Trouble is that you have to restore all the models running on the machine, so the restore causes them all to repeat some computing time. Restoring a single model on a multi-core computer can be done, but the procedure is said to be a hassle, big-time. Cpdn news ID: 28998 · Reply Quote

MikeMarsUK Volunteer moderator Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0	Message 29007 - Posted: 27 May 2007, 9:54:15 UTC Astro found the odds of getting it going again after a NEGATIVE THETA/PRESSURE are low (it will have already retried 3 times in any case, i.e., the day/month/year restart). The only case where it would work is if it was due to a computer glitch before the start of the current model year. I'm a volunteer and my views are my own. News and Announcements and FAQ ID: 29007 · Reply Quote

Pete McCann Send message Joined: 8 Mar 06 Posts: 24 Credit: 12,791,616 RAC: 0	Message 29011 - Posted: 27 May 2007, 11:30:33 UTC - in response to Message 29007. As my backup is about a week old, I have just continued with the two remaining models, hoping that they to don\'t die on me as well. 2 out of 4 is already unlucky. 4 out of 4 would be a disaster! I have put a copy of the backup to one side, to run it again on a single core machine at some point. Seems a bit of a waste of time to do this now on a quad core machine. Did the one that just failed make it to 2050 by the way? I wasn\'t sure how to check. On a brighter note, I\'ve just got a pair of models on another machine to completion, over on the BBC side. Pete ID: 29011 · Reply Quote

MikeMarsUK Volunteer moderator Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0	Message 29017 - Posted: 27 May 2007, 20:33:06 UTC Just barely to 2050: The link to the result is here, you\'ll find the graph off the right-hand edge of the screen near the bottom. http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=6278224 I'm a volunteer and my views are my own. News and Announcements and FAQ ID: 29017 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 29019 - Posted: 27 May 2007, 21:21:50 UTC Last modified: 27 May 2007, 21:23:57 UTC Are my other 2 models from this batch also likely to crash? All 4 were downloaded at the same time. Do I have a bad batch? \"Bad batch\" isn\'t quite the right phrase. All datasets in a batch are similar but slightly different, and are exploring different areas of \"climate space\". Whether or not any or all will fail to complete, depends on the exact combinations of values for all of the starting parameters. If it was possible to know before a model completes what the outcome would be, then it wouldn\'t be necessary to run the models in the first place. Think of it as a ruler, partially positioned over the edge of a table. Now slowly tap the \'on table\' end until it falls off. How close can you get the center of the ruler to the edge of the table? This depends on the ruler material being the same mass for the full length, the sides being exactly parallel, how small an amount you can tap it, etc. The slightly different datasets may ALL result in failure before the full run, or some may make it and others not. But at which \'end\' of the values are the short runners, and which the long runners? And at what point do they start failing? I\'ve always felt that a model that fails should be left that way, so that the researchers can tell that it HAS failed. If the failure was due to an unstable computer, then making it more stable, by e.g. not overclocking as much, then that\'s OK, but doing everything possible to make it continue, such as moving it to a different brand of processor, with slightly different maths routines is, I feel, cheating. A bit like starting the Indianapolis 500 in a Ferrari, and finishing it in a Lamborghini. Others feel differently about this. ID: 29019 · Reply Quote

Pete McCann Send message Joined: 8 Mar 06 Posts: 24 Credit: 12,791,616 RAC: 0	Message 29027 - Posted: 28 May 2007, 10:49:45 UTC - in response to Message 29019. If the failure was due to an unstable computer, then making it more stable, by e.g. not overclocking as much, then that\'s OK, but doing everything possible to make it continue, such as moving it to a different brand of processor, with slightly different maths routines is, I feel, cheating. Hi thanks for the replies. That\'s good news that the model just made it to 2050. That\'s another \'completed\' one for the headline stats anyway. I\'m fairly sure this will not be a computer error. This model was running on a server board with opterons and registered ECC memory. It is not overclocked at all, so it should be pretty damn stable. I\'ll probably restore a backup at some point just to make sure it crashes at the same place. What figure for the timeslices is represented by the 2050 model year, or the point at which a model is deemed completed for the headline stats? Cheers. ID: 29027 · Reply Quote