climateprediction.net home page
Iceworld (HadSM and HadSM MH) discussion

Iceworld (HadSM and HadSM MH) discussion

Message boards : Number crunching : Iceworld (HadSM and HadSM MH) discussion
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Profile Iain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1081
Credit: 7,032,570
RAC: 4,727
Message 40920 - Posted: 26 Oct 2010, 17:35:02 UTC - in response to Message 40919.  

[Urglab wrote:] Hi, I just noticed one of my tasks turned snowball too. Progress is at 29.43%
... another Windows/Intel machine in the work unit has got further with that model (here), which suggests that your model might be recoverable if you have a backup. If not and the graphics stay blue for a while then the model should be aborted as something has gone seriously wrong.
ID: 40920 · Report as offensive     Reply Quote
Profile Iain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1081
Credit: 7,032,570
RAC: 4,727
Message 40921 - Posted: 26 Oct 2010, 17:53:54 UTC - in response to Message 40918.  

[Hans-Henrik Husen wrote:] I'm running the hadsm3fub_jwxd_006453195_1 ... Does anybody have an explanation?
First of all, welcome to the message board - and there is indeed an explanation.

That model has become an 'iceworld', which on a Windows/Intel machine means that it processes very slowly and the temperature graphical display shows all blue. The model will eventually finish (another user in that work unit has finished - after 6,345,735 s [i.e. 73 days!]), but our advice is to abort such models and get on with something a bit more productive.

Ideally, CPDN models could be run without any user intervention but unfortunately the user does sometimes have to get involved. This particular problem affects only models in the HADSM3 family. The other model currently running, FAMOUS, operates at the other extreme: if it finds something wrong it stops, which can be frustrating as well - but at least no time is lost.
ID: 40921 · Report as offensive     Reply Quote
Hans-Henrik Husen

Send message
Joined: 7 Sep 09
Posts: 2
Credit: 13,113,974
RAC: 0
Message 40922 - Posted: 26 Oct 2010, 18:17:40 UTC - in response to Message 40921.  

Thank you for your answer! I learned something new today - Iceworld! I'll abort the model (and another Iceworld model I also have running).

Regards

HH Husen
ID: 40922 · Report as offensive     Reply Quote
old_user249784

Send message
Joined: 15 Feb 06
Posts: 18
Credit: 131,262
RAC: 0
Message 41061 - Posted: 17 Nov 2010, 11:29:45 UTC

My project hadsm3dhet2_k2xl_006613451 has developed an Iceworld, but it has reached over 97% (237238/259248), so should I let it run to completion?
ID: 41061 · Report as offensive     Reply Quote
Profile Iain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1081
Credit: 7,032,570
RAC: 4,727
Message 41062 - Posted: 17 Nov 2010, 11:43:55 UTC - in response to Message 41061.  
Last modified: 17 Nov 2010, 11:44:54 UTC

My project hadsm3dhet2_k2xl_006613451 has developed an Iceworld, but it has reached over 97% (237238/259248), so should I let it run to completion?

It will eventually finish, so you could finish it if you want to; however, it will possibly take a month or so to do it!

Since someone else has already finished that model on Windows/Intel - complete with iceworld - my advice would be to abort it and run a fresh model.
ID: 41062 · Report as offensive     Reply Quote
old_user249784

Send message
Joined: 15 Feb 06
Posts: 18
Credit: 131,262
RAC: 0
Message 41063 - Posted: 17 Nov 2010, 12:05:07 UTC - in response to Message 41062.  

OK, thanks! It seems to have run backwards in the last hour!
ID: 41063 · Report as offensive     Reply Quote
Profile Iain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1081
Credit: 7,032,570
RAC: 4,727
Message 41064 - Posted: 17 Nov 2010, 12:13:13 UTC - in response to Message 41063.  

OK, thanks! It seems to have run backwards in the last hour!
In that case you must certainly abort it. Some iceworlds get into a looping state in which they endlessly repeat the same "checkpoint" (i.e. up to 144 timesteps). I've not been able to work out a pattern for which processor versions do and don't suffer that fate (an old P4 of mine and a laptop did that) - but stopping the model is the only option.
ID: 41064 · Report as offensive     Reply Quote
Jazzop

Send message
Joined: 8 May 05
Posts: 2
Credit: 1,373,627
RAC: 0
Message 41222 - Posted: 4 Dec 2010, 22:20:25 UTC

Please tell me what's up with this one:

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6702471

It has been stalled at 83% for several days and the "to completion" time keeps climbing.

I can't obtain any information from the graphic because I can't display graphics for this or any project (due to protected application installation of BOINC?)

It's running under BOINC 6.10.58, 32-bit WinXP, on a dual-boot MacBook (Core2 Duo P7350).
ID: 41222 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 41223 - Posted: 4 Dec 2010, 23:28:31 UTC - in response to Message 41222.  

This is the model in question.
There's no error information on that page, and the s/TS looks constant.

The only place that info about the condition of models exists before the final upload, is on people's computers. And a lot of this is in the various graphics displays.

So, your guess is as good as ours. :( Sorry.

About the only advice that I could give you, would be to keep running it until you get bored with the lack of progress, and then abort it.
This will provide visible info on what has gone wrong.


Backups: Here
ID: 41223 · Report as offensive     Reply Quote
Jazzop

Send message
Joined: 8 May 05
Posts: 2
Credit: 1,373,627
RAC: 0
Message 41225 - Posted: 4 Dec 2010, 23:49:17 UTC - in response to Message 41223.  

This is the model in question.
There's no error information on that page, and the s/TS looks constant.

The only place that info about the condition of models exists before the final upload, is on people's computers. And a lot of this is in the various graphics displays.

So, your guess is as good as ours. :( Sorry.

About the only advice that I could give you, would be to keep running it until you get bored with the lack of progress, and then abort it.
This will provide visible info on what has gone wrong.



Killing it. It ran for 482 hours, which is about 3x what these things normally run. I'm getting rid of this machine next week, so I don't have time to wait for it anymore.
ID: 41225 · Report as offensive     Reply Quote
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 41230 - Posted: 5 Dec 2010, 14:26:35 UTC

It definitely was an iceworld in Intel + Windows. Here's an Intel/Win model from thesame workunit. Look at the sec/timestep. The slowdown is really much worse than the numbers suggest because we see the cumulative average, not the current speed.

So you did the right thing to abort it.

What a pity that more members with an iceworld don't report the problem on the forum. If that member had reported the iceworld in February 2010 you would have received an email warning you about this probability.

Another member with Intel/Win is still running the model but it's less advanced. She needs an email.
Cpdn news
ID: 41230 · Report as offensive     Reply Quote
old_user611146

Send message
Joined: 23 Jan 10
Posts: 1
Credit: 3,321,873
RAC: 0
Message 41441 - Posted: 5 Jan 2011, 18:35:10 UTC - in response to Message 39383.  

If you would like another data point for debugging, it looks like I have an ice world in the Slab 6.07 model. It's been running while and a week or two ago I noticed that it looked like it was going backwards. Today I realized that at the rate it was going it would never make the deadline and decided to look into it. Do you want me to abort it now, or leave it running for a while longer? It is currently suspended.

# A link to the model/ResultID webpage
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=10955184

# A current timestep of that model (on the globe graphic)
16/03/1815 05:00

# The s/TS value (on the globe graphic. Remember, you can hit the Z key while
viewing the globe and it will give you this additional text/status information.)
70.41

# Whether the temperature display of the globe graphic is blue.
All blue, I would say it's all -42 (hard to tell, those blues are pretty close)

# What your processor/CPU and Operating System is (i.e. Intel or AMD on Windows or Linux)
Dual Nehalem Xeon 2.26GHz quad core with hyperthreading enabled
Windows 7

# Whether you are overclocking.
No overclocking (SuperMicro server motherboard doesn't allow it)
ID: 41441 · Report as offensive     Reply Quote
Profile Iain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1081
Credit: 7,032,570
RAC: 4,727
Message 41442 - Posted: 5 Jan 2011, 19:10:19 UTC

Thanks for reporting that, Mark. It's definitely an iceworld, as all the other Windows/Intel machines have run into the same problem. That one didn't take long before misbehaving!

The only option for a phase-1 iceworld is to abort it, as it will never recover.

Fortunately, the current FAMOUS and HADAM3P models don't become slow-processing iceworlds.
ID: 41442 · Report as offensive     Reply Quote
old_user603162

Send message
Joined: 25 Nov 09
Posts: 1
Credit: 204,092
RAC: 0
Message 41621 - Posted: 10 Feb 2011, 21:26:09 UTC

Hi guys,

So two possible units that seem to meet this criteria, as follows:

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=10079626
Timestep: 154983 of 259248
s/TS 7.4
Temp: All blue

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=10079752
Timestep: 247718 of 259248
s/TS: 4.92
Temp: All blue

Intel I7 960 3.2GHz quad core - Windows 7

It's been running for over a year and I've only just noticed these two haven't completed yet!
ID: 41621 · Report as offensive     Reply Quote
Profile Iain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1081
Credit: 7,032,570
RAC: 4,727
Message 41622 - Posted: 10 Feb 2011, 22:21:23 UTC - in response to Message 41621.  
Last modified: 11 Feb 2011, 14:56:54 UTC

[Ben Smith wrote:] It's been running for over a year and I've only just noticed these two haven't completed yet!
Two finishes and two aborts in those work units after a colossal amount of time. Unfortunately, there's nothing we can do other than advertise the problem here; the slab model has been retired now so it won't be fixed.

Abort them!
ID: 41622 · Report as offensive     Reply Quote
crystalsys

Send message
Joined: 18 Feb 05
Posts: 5
Credit: 2,592,605
RAC: 361
Message 41692 - Posted: 4 Mar 2011, 21:37:32 UTC

OK, I'm killing this one

Task 10941637
Name hadsm3dhet2_jjad_006587991_2
Workunit 6791364
Computer ID 1047305

I'm not sure exactly when it went bad, but I started trying to figure out what was going on when the database work started and I couldn't access the forums - definitely an iceball.
ID: 41692 · Report as offensive     Reply Quote
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 41704 - Posted: 5 Mar 2011, 18:24:27 UTC

Hi Crystalsys

You definitely did the right thing to abort that iceworld. Here it is. You can see from the seconds per timestep column on the right that the slowdown started over a month ago. With this type of model this problem doesn't correct itself.

Other computers with a model from the same workunit are managing to finish it if they have an AMD processor ie not the same as your Intel or have Linux or Mac, not Win like you.

However, there's a member with Intel + Win like you who's hit the same 'iceworld' at the same point as you. Here's the model, truly stuck in this loop for ages and probably unnoticed by its owner. I shall ask our new acting sysadmin, Jonathan, if he can send one of the special iceworld emails to this hapless member.

So thanks for informing us, Crystal.
Cpdn news
ID: 41704 · Report as offensive     Reply Quote
Previous · 1 · 2

Message boards : Number crunching : Iceworld (HadSM and HadSM MH) discussion

©2024 climateprediction.net