climateprediction.net home page
Project keeps resetting - any explanations?

Project keeps resetting - any explanations?

Message boards : Number crunching : Project keeps resetting - any explanations?
Message board moderation

To post messages, you must log in.

AuthorMessage
old_user51877

Send message
Joined: 3 Feb 05
Posts: 1
Credit: 28,109
RAC: 0
Message 48753 - Posted: 10 Apr 2014, 10:29:07 UTC

A new project started running on my iMac a few days ago - UK Met Office HADAM3P Australia NZ. It seems to reset itself every few minutes, elapsed time and %completed returning to 0. Is it meant to do this? Running on Mac OSX 10.9.2
ID: 48753 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4309
Credit: 16,356,117
RAC: 5,224
Message 48755 - Posted: 10 Apr 2014, 10:39:38 UTC - in response to Message 48753.  

No it isn't meant to do this. Some models get stuck in a loop. Unless one of the moderators or someone with more knowledge than myself can think of anything you will have to abort.

If it keeps happening on other tasks then we go back to the drawing board.
ID: 48755 · Report as offensive     Reply Quote
Profile Iain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1079
Credit: 6,839,375
RAC: 6,646
Message 48756 - Posted: 10 Apr 2014, 10:42:54 UTC
Last modified: 10 Apr 2014, 10:50:52 UTC

One possibility is suggested by the history recorded for a model run earlier on that machine, hadcm3n_7vsn_1980_40_008452490_2. If you click on the '+' icon next to Stderr then a model log will appear. This shows a large number of entries of the form 'Suspended CPDN Monitor - Suspend request from BOINC...'. These entries occur because the default BOINC settings try to minimise the impact of BOINC on the computer, which is presumably used for something else most of the time. That standard setting does not work very well with the climate models, which are larger than most BOINC models.

What may be happening is that two BOINC settings are interacting badly: the 'suspend when PC busy' setting and the 'leave application in memory' setting, such that each time the application is suspended it has to restart from the last save point, which for the ANZ models might be separated by a long time (> 10 minutes). So, if the model is suspended more often than the save interval then it will not make any progress.

If this is indeed the cause, then the solution is twofold: in BOINC Manager, (1) make sure that 'leave applications in memory while suspended' is selected, and (2) make sure that the 'while processor usage is less than' setting is set to zero (which will stop the suspensions). These options are in Tools | Computing preferences: the suspension setting is on the 'processor usage' tab and the memory setting is on the 'disk and memory usage' tab.

If that doesn't work then please post back here: someone else might have a better idea ...
ID: 48756 · Report as offensive     Reply Quote
MichaelO

Send message
Joined: 8 Aug 05
Posts: 12
Credit: 24,424,627
RAC: 0
Message 49360 - Posted: 14 Jun 2014, 19:39:45 UTC

I just experienced a significant issue with a Windows 7 64-bit machine with my CPDN project seemingly resetting - or maybe more accurately, attempting to restart multiple times - repeatedly. This discussion thread seems to fit closest the issue I encountered, because it appears to have been caused by my recently clearing the Local Preferences.

The problem seems to have been caused by clearing the Local Preference settings and restarting BOINC Manager. That was yesterday.

This morning, I found that Windows was warning of: 1) low memory; 2) the BOINC Manager trying multiple times to reconnect to a client (I assume CPDN as SETI was still running); and 3) a message that virtual memory (paging) was low.

The machine was also clearly unstable. As I tried to investigate via Windows Task Manager, physical memory was full, showing only 1 MB free out of 12 GB physical memory installed. Also, through Advanced Settings, virtual memory was fixed at 512 MB, but Windows was recommending 18 GB (maximum; but 16 MB minimum). I believe I have fixed all these issues.

Now, though, BOINC Manager does not list any CPDN tasks. SETI tasks are still present; and the SETI tasks appear to be running okay, too.

BOINC Manager shows that CPDN Disk size is over 74 GB, which makes sense, because I had the Local Preference setting set for at least 10 days of work to be stored.

First, any ideas why changing the Local Preferences might have caused memory to fill up and - apparently - crash CPDN? Do these events appear related?

Second, any suggestions on how to get CPDN tasks restarted? Or, is this a hopeless cause requiring a "project reset?" Again, the Local Preferences have been reset to previous values - and in line with this thread's recommendations, in fact.

Thanks for any help in advance!
ID: 49360 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2167
Credit: 64,403,322
RAC: 5,085
Message 49361 - Posted: 14 Jun 2014, 20:17:51 UTC - in response to Message 49360.  
Last modified: 14 Jun 2014, 20:32:26 UTC

BOINC Manager shows that CPDN Disk size is over 74 GB, which makes sense, because I had the Local Preference setting set for at least 10 days of work to be stored.


This sounds complex. Which PC of the 4 that you have listed is the one that has the problem?

There is no way that there should be 74 GB in the CPDN project directory. Looking at all your PCs, none of them have enough tasks "in progress" to have anything like that. It's possible that some crashed tasks have left-over directories that are full of files. Look in the Tasks tab of BOINC Manager and make sure that it is set to Show All Tasks, i.e. if the button says "Show all tasks", click on it. If it says "Show active tasks", leave it as is. Any model directory under the climateprediction.net directory that doesn't correspond to a listed task can be deleted. That should get rid of many GB of space.

As for preferences, cpdn seems to work best with the following preferences:
Computing allowed:
While computer in use
Only after computer has been idle 0 minutes
While processor use is less than 0 percent

Use at most 100% CPU Time

Leave applications in memory when suspended

I'm not sure why you would have a memory problem. Generally cpdn executables are not memory hogs. If you have "Leave applications in memory when suspended" set to yes, and there are numerous projects with numerous tasks being suspended, then I could potentially see an issue.
ID: 49361 · Report as offensive     Reply Quote
MichaelO

Send message
Joined: 8 Aug 05
Posts: 12
Credit: 24,424,627
RAC: 0
Message 49370 - Posted: 16 Jun 2014, 7:19:13 UTC - in response to Message 49361.  

Hi geophi,

The computer with the issue is ID: 926174. (It has the name: mjo003).

I verified that there are no CPDN tasks listed - active or pending. (I also left the command for Tasks set to "Show active tasks.")

So, to summarize what you are telling me - since there are no active or pending CPDN project tasks, all 74 GB of data in the directory - associated with CPDN - can be cleared out.

Another question come to mind given what you just had me do -- Could it be that my change in preference settings simply allowed the remaining tasks that were running on this machine to end quickly? I tried to make some heads-or-tails out of recent trickle info, but I am not sure I am interpreting it correctly.

Anyway, the disk value has been typical for this machine for quite sometime, which is why I thought it was "normal." So -- I have one last question: Is it easiest to simply try a project "reset" to clean out the directory? I would dislike damaging the directory structure the way I go about wiping folders and disks... Thanks!
ID: 49370 · Report as offensive     Reply Quote
Ingleside

Send message
Joined: 5 Aug 04
Posts: 108
Credit: 17,923,039
RAC: 36,401
Message 49371 - Posted: 16 Jun 2014, 7:58:28 UTC - in response to Message 49370.  

Anyway, the disk value has been typical for this machine for quite sometime, which is why I thought it was "normal." So -- I have one last question: Is it easiest to simply try a project "reset" to clean out the directory? I would dislike damaging the directory structure the way I go about wiping folders and disks... Thanks!

Reset should work.
ID: 49371 · Report as offensive     Reply Quote
MichaelO

Send message
Joined: 8 Aug 05
Posts: 12
Credit: 24,424,627
RAC: 0
Message 49373 - Posted: 17 Jun 2014, 7:01:41 UTC - in response to Message 49371.  

Things seem to be working right now. Again, unless I am way off base in interpreting what I am observing on my machine, I noted that the Disk tab's Disk Usage pie-chart did not drop. Rather, it increased to ~77 GB - about a 2+ GB increase! I'll wait a bit longer to see what is going on and then try deleting files that do not appear in in the Task listing.

Thanks for the help!


ID: 49373 · Report as offensive     Reply Quote
MarkJ
Avatar

Send message
Joined: 28 Mar 09
Posts: 126
Credit: 9,825,980
RAC: 0
Message 49387 - Posted: 19 Jun 2014, 13:59:55 UTC
Last modified: 19 Jun 2014, 14:03:18 UTC

An easier way to clean up the project directory would be to delete project CPDN and then add it back. For BOINC v6 that would be remove and then add the project. That is go to the Projects tab in BOINCmgr, select CPDN and then click on the delete (or remove) button. To add it back click on the Tools menu -> Add project and then select CPDN from the list

That should clean out your project folder and is much easier that trying to work out what's needed and what is not.
BOINC blog
ID: 49387 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,888,554
RAC: 1,481,373
Message 49388 - Posted: 19 Jun 2014, 22:30:47 UTC

If cpdn is working for you, and you have lots of disk - prudent not to mess with it.

But 77 Gibibytes is way way too much.

Either manually remove the aged cruft - or wait until nothing running and reset the project.

When convenient for you.
ID: 49388 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 49389 - Posted: 20 Jun 2014, 5:26:18 UTC

Michael

A project Reset is less of a problem than a project Disconnect.

Computer 926174 is the one with all of the crashed models.
This crashing leaves behind small amounts of files, which can quickly add up to a lot of Gigs if you don't regularly clean then out. Which is EASY to do manually.
Also, a manual clean up will show you just how many you're crashing, and how much is getting left behind.

As for the size INCREASING a couple of Gigs, I'm not surprised, if you left the computer to download more, and then crashed some of them them too.
8 processors can quickly destroy a lot of data sets.

And perhaps the reason for the crashing, is that you've left the option: Suspend work if CPU usage is above to the default of 25%.
This may be OK for other projects, but here it can be fatal for climate models, which DON'T like being constantly interrupted.
ID: 49389 · Report as offensive     Reply Quote
MichaelO

Send message
Joined: 8 Aug 05
Posts: 12
Credit: 24,424,627
RAC: 0
Message 49390 - Posted: 20 Jun 2014, 6:30:56 UTC - in response to Message 49388.  
Last modified: 20 Jun 2014, 6:31:57 UTC

Thanks all for the assistance!

Les, you are correct about Suspend Work... - I thought I had followed geophi's list of recommended settings. Obviously, I didn't double-check.

As for your advice, Eirik, I will plan a time to clean things up, because I am up to 80+GB as of today - 19 Jun (PST).

Sorry to have been slow to ask questions about the disk usage before I wasted all those millions of CPU seconds around 14 Jun 2014! But, now I understand what I have been doing wrong. Still, it makes me sad to realize this large loss of work was mostly preventable...

Best regards to all!
ID: 49390 · Report as offensive     Reply Quote

Message boards : Number crunching : Project keeps resetting - any explanations?

©2024 climateprediction.net