climateprediction.net home page
New work Discussion

New work Discussion

Message boards : Number crunching : New work Discussion
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 57 · 58 · 59 · 60 · 61 · 62 · 63 . . . 91 · Next

AuthorMessage
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 63420 - Posted: 28 Jan 2021, 15:46:33 UTC - in response to Message 63418.  

Each model has a lot of files open that need to be saved.
If a model is in the process of check pointing at the time of a crash, then what's on the disk will be part old save, and part new.
When it tries to start up again, the files don't match, so the model can't start.

The more models that get crammed into the newer computers with their large number of processors, the more likely it will be that there will be constant check pointing.
ID: 63420 · Report as offensive
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 63421 - Posted: 28 Jan 2021, 18:13:07 UTC - in response to Message 63420.  

Each model has a lot of files open that need to be saved.
If a model is in the process of check pointing at the time of a crash, then what's on the disk will be part old save, and part new.
When it tries to start up again, the files don't match, so the model can't start.

The more models that get crammed into the newer computers with their large number of processors, the more likely it will be that there will be constant check pointing.
I can see how that would fail, but that's because it's a very bad design. The next checkpoint should be saved to a 2nd file, then the old one deleted, then the new one renamed if necessary. Almost every program does this, including say a word processor. When you click save, you can see a temporary file appearing for a fraction of a second. If the power is cut off while that is happening, the original file is not destroyed.
ID: 63421 · Report as offensive
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 63422 - Posted: 28 Jan 2021, 18:27:03 UTC - in response to Message 63419.  

Same happened with two on a working machine, which I rebooted cleanly. Should Boinc not gracefully shut down running CPDN tasks itself?
You are right, BOINC should restart the task from the last checkpoint reached. In the past, my memory is of this being a bigger problem with Linux tasks but I haven't had a problem with it recently, even when I have updated the Linux kernel which requires a reboot. My experience a few years ago was that a kernel change combined with a reboot greatly increased the chances of tasks crashing.

To minimise the chances of tasks crashing, I suspend tasks individually, exit BOINC manager and client before rebooting. On restarting, I resume tasks one at a time, allowing a couple of minutes between resuming individual tasks. I don't know if on the most recent task types this makes any difference but it used to. I don't know what happens with other projects. For a fair comparison you might need to look at something like LHC@home which like CPDN has a large number of files open at once, all of which need closing down by BOINC when exiting.

If you reboot without exiting BOINC first, again in theory tasks should resume from previous checkpoint but experience tells me that doing so dramatically increases the chances of failure though last time I had a power failure, all tasks survived.

I am not really sure if this is a BOINC issue or a CPDN one which makes sorting it out difficult.
Surely Windows waits for Boinc to close all files first? In the case of LHC, I get an impatient warning from Windows saying Virtualbox has "active connections". I can click shut down anyway, or cancel. I click cancel, then watch in the task manager until I can see no processing, disk activity, or network activity is happening, then shut down again. Or sometimes I remember to close Boinc first, wait until zero activity, then shut down Windows. Windows still claims Virtualbox has "active connections" which I just ignore. I'm not going to go into Virtualbox itself and mess around to stop stuff. LHC tasks seem to be quite robust though, I've only ever seen them go wrong if the system crashes. And once when a hard disk was failing, although not producing errors, it was going very slowly, so the LHC tasks were giving up waiting for disk access and saying "computation error" in Boinc. It took someone in LHC to tell me they'd seen a certain error in the log file so I knew the disk was too old and tired!

I can't have power problems on my main machine as it's protected by a UPS, but I haven't gone to the expense of a larger one for the 6 Boinc-only machines. If the UPS goes onto battery mode, Boinc immediately suspends to make the battery last longer, and the monitors are turned off. If the battery is almost empty, the PC hibernates, so any tasks should resume where they left off. Although I rarely get proper powercuts, I do get the odd fraction of a second powercut. And the voltage varies from 241 to 256V (it should be 230V), so the transformer in the UPS levels that off. More to protect the house lighting actually, since I had a lot of LED lights fail with bad voltages. And because once a 0.5 second powercut caused a bad corruption of the system disk and I lost a few documents. The cheapest thing to do is buy a second hand UPS with a busted battery, and replace it by connecting it to two (if it needs 24V) or more large car batteries. Much cheaper than the sealed ones that come with it, and you get a lot of run time!
ID: 63422 · Report as offensive
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 63424 - Posted: 28 Jan 2021, 22:47:54 UTC - in response to Message 63421.  

Whatever the fine details are, it works.
I tested this a few days ago with a new test model, and it worked perfectly.

So it's your computers and the way that you use them.
And any one else silently having the same problem.

Just something that will have to be lived with I guess.
ID: 63424 · Report as offensive
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 63427 - Posted: 29 Jan 2021, 11:41:19 UTC - in response to Message 63424.  

Whatever the fine details are, it works.
I tested this a few days ago with a new test model, and it worked perfectly.

So it's your computers and the way that you use them.
And any one else silently having the same problem.

Just something that will have to be lived with I guess.
I can't think of anything unusual that would cause this. It's either a computer crash or power failure while CPDN is running, or me simply clicking restart in Windows without closing Boinc first. I would imagine most folk experience these things. It's a small chance it will happen, and usually only 1 or 2 of the running tasks break, but larger if it was a hard crash/power off instead of just rebooting.

If you deliberately power off a Windows machine to simulate a powercut for example, I'm sure after a few times, some of the tasks would cause a computation error.
ID: 63427 · Report as offensive
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4532
Credit: 18,833,252
RAC: 21,189
Message 63428 - Posted: 29 Jan 2021, 14:44:10 UTC - in response to Message 63427.  

If you deliberately power off a Windows machine to simulate a powercut for example, I'm sure after a few times, some of the tasks would cause a computation error.


I have seen hard resets i.e. power outage or by hitting the power switch crash tasks on more projects than just CPDN. However, while not having coded for decades and never anything as complex as the climate models or BOINC itself, I still say that it should be possible to write the code so that even then it can resume after a checkpoint. I accept that that might mean more disk usage during computation. The problem is that most of the code used in the executeable files for CPDN is over a million lines of fortran (before compiling) that comes from the met office and is used under a license that does not allow the sort of playing with the code that might be needed to resolve the problem. It will be interesting to see what happens with OpenIFS eventually as that code is open source.
Please do not private message myself or other moderators for help. This limits the number of people who are able to help and deprives others who may benefit from the answer.
ID: 63428 · Report as offensive
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 63429 - Posted: 29 Jan 2021, 18:08:58 UTC - in response to Message 63428.  

If you deliberately power off a Windows machine to simulate a powercut for example, I'm sure after a few times, some of the tasks would cause a computation error.


I have seen hard resets i.e. power outage or by hitting the power switch crash tasks on more projects than just CPDN. However, while not having coded for decades and never anything as complex as the climate models or BOINC itself, I still say that it should be possible to write the code so that even then it can resume after a checkpoint. I accept that that might mean more disk usage during computation. The problem is that most of the code used in the executeable files for CPDN is over a million lines of fortran (before compiling) that comes from the met office and is used under a license that does not allow the sort of playing with the code that might be needed to resolve the problem. It will be interesting to see what happens with OpenIFS eventually as that code is open source.
A million! Ok, we should be happy it works at all!
ID: 63429 · Report as offensive
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4532
Credit: 18,833,252
RAC: 21,189
Message 63430 - Posted: 29 Jan 2021, 19:41:50 UTC - in response to Message 63429.  

A million! Ok, we should be happy it works at all!


Yes, when thinking about what we would like the code to be like, I am reminded of the story of someone in a remote rural village asking for directions. The local thinks for a minute before replying,

"Well if I wanted to go there, I wouldn't want to be starting from here."
Please do not private message myself or other moderators for help. This limits the number of people who are able to help and deprives others who may benefit from the answer.
ID: 63430 · Report as offensive
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 63431 - Posted: 29 Jan 2021, 22:19:13 UTC - in response to Message 63430.  
Last modified: 29 Jan 2021, 22:19:41 UTC

A million! Ok, we should be happy it works at all!


Yes, when thinking about what we would like the code to be like, I am reminded of the story of someone in a remote rural village asking for directions. The local thinks for a minute before replying,

"Well if I wanted to go there, I wouldn't want to be starting from here."
So in that story, I'm the innocent person asking for directions, and the programmers are the ones giving deliberately obtuse answers to evade the problem? :-P

And is that from a Monty Python sketch? I've heard it before somewhere.
ID: 63431 · Report as offensive
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 63432 - Posted: 30 Jan 2021, 1:12:47 UTC - in response to Message 63431.  

The programmers are in the UK Met Office, who aren't involved in this project.
So you're just making things up.

And no, the quote is Not Monty Python. It's much older than that.
ID: 63432 · Report as offensive
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4532
Credit: 18,833,252
RAC: 21,189
Message 63433 - Posted: 30 Jan 2021, 7:45:57 UTC - in response to Message 63432.  

And no, the quote is Not Monty Python. It's much older than that.
Earliest reference I found to it being in print with a cursory web search was 1924 but it may be quite a bit older even than that.
Please do not private message myself or other moderators for help. This limits the number of people who are able to help and deprives others who may benefit from the answer.
ID: 63433 · Report as offensive
KAMasud

Send message
Joined: 6 Oct 06
Posts: 204
Credit: 7,608,986
RAC: 0
Message 63435 - Posted: 30 Jan 2021, 11:37:13 UTC

Patience, I suppose.
ID: 63435 · Report as offensive
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 63436 - Posted: 30 Jan 2021, 18:30:03 UTC - in response to Message 63432.  
Last modified: 30 Jan 2021, 18:30:42 UTC

The programmers are in the UK Met Office, who aren't involved in this project.
So you're just making things up.

And no, the quote is Not Monty Python. It's much older than that.
I'm not making anything up, I didn't say who I was having a go at. But I have no idea why you used the quote.
ID: 63436 · Report as offensive
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 63437 - Posted: 30 Jan 2021, 18:31:19 UTC - in response to Message 63433.  

And no, the quote is Not Monty Python. It's much older than that.
Earliest reference I found to it being in print with a cursory web search was 1924 but it may be quite a bit older even than that.
I probably heard someone else repeating it, a stand up comedian or something.
ID: 63437 · Report as offensive
Helix Von Smelix

Send message
Joined: 31 Aug 04
Posts: 7
Credit: 56,478,951
RAC: 9,530
Message 63455 - Posted: 2 Feb 2021, 12:48:41 UTC - in response to Message 63437.  

Thank you guys who overclock or have failed tasks. I am getting tasks with 2 or 3 at the end and i run them okay.
ID: 63455 · Report as offensive
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 63459 - Posted: 2 Feb 2021, 15:17:15 UTC - in response to Message 63455.  

Thank you guys who overclock or have failed tasks. I am getting tasks with 2 or 3 at the end and i run them okay.
Some might be mine. In which case you owe me a pint.
ID: 63459 · Report as offensive
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4532
Credit: 18,833,252
RAC: 21,189
Message 63485 - Posted: 3 Feb 2021, 17:29:28 UTC

For Linux users, there may be another new model type on its way. HadSM4. These are similar to HadAM4 but using a slab model of the ocean rather than surface temperatures. First six month runs of these seem to not have problems in testing. (Five have completed. My five have about four hours to go.) I would guess there will be an official notice about these closer to them appearing on the main site. Time scale for these is at present anyone's guess but as Les said in another thread,

"Don't forget to keep breathing."
Please do not private message myself or other moderators for help. This limits the number of people who are able to help and deprives others who may benefit from the answer.
ID: 63485 · Report as offensive
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 63556 - Posted: 28 Feb 2021, 18:35:22 UTC

There are more UK Met Office HadCM3 shorts available.
I will leave them for the Windows users.
ID: 63556 · Report as offensive
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 63557 - Posted: 28 Feb 2021, 18:47:04 UTC - in response to Message 63556.  

There are more UK Met Office HadCM3 shorts available.
I will leave them for the Windows users.
Well that didn't last, I just told 6 machines to grab some, and none are left.
ID: 63557 · Report as offensive
Profile Iain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,763,931
RAC: 4,542
Message 63558 - Posted: 28 Feb 2021, 18:53:07 UTC - in response to Message 63557.  

There are more UK Met Office HadCM3 shorts available.
I will leave them for the Windows users.
Well that didn't last, I just told 6 machines to grab some, and none are left.

... I'm not sure there were any, as there haven't been any additions to the work unit list.
ID: 63558 · Report as offensive
Previous · 1 . . . 57 · 58 · 59 · 60 · 61 · 62 · 63 . . . 91 · Next

Message boards : Number crunching : New work Discussion

©2024 cpdn.org