climateprediction.net home page
VANISHING WU'S

VANISHING WU'S

Message boards : Number crunching : VANISHING WU'S
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
Profile astroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 48058 - Posted: 26 Jan 2014, 16:18:42 UTC - in response to Message 48056.  

Primary reason is server upgrade. That's no small task, given all the unique code added for this project, which uses the server somewhat differently when compared to other projects.

My machines heat my house, so the electricity does double duty (otherwise, the ceiling radiant heat, electric, would have to be used). CPDN doesn't benefit now but Einstein and WCG get a small boost.

What's the Russian general's line in "War and Peace"? Patience and time... time and patience.
"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 48058 · Report as offensive     Reply Quote
Ingleside

Send message
Joined: 5 Aug 04
Posts: 108
Credit: 19,072,610
RAC: 36,507
Message 48061 - Posted: 27 Jan 2014, 17:13:03 UTC - in response to Message 48023.  

Every night at midnight your time you'll still have your work quota put back to one model per core per day.

Uhm, in older BOINC server-code it was midnight server-time, not user-time, so for CPDN this would equal midnight GMT in the winter.

Since having all quota-limited computers connecting the hour after midnight server-time gave an extra spike in server-load, in more resent server-code the "midnight" is instead randomly assigned to individual computers, meaning someone with multiple computers can have one computer getting a new quota at 01:23:45, another at 12:33:44, a third at 05:43:21 and so on. I'm not sure if CPDN has resent-enough code to have this functionality or the older midnight-server-time-code...



ID: 48061 · Report as offensive     Reply Quote
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 48063 - Posted: 27 Jan 2014, 18:06:33 UTC

You are right; it's midnight servertime, not usertime. Sorry for the mistake.

I believe that after the server's midnight, computers are allowed to request new work at a random number of minutes during the next hour. I'm not sure how this affects CPDN as our computers can only request work once per hour anyway. I imagine it means the work quota is reset at the computer's first server contact after server midnight.
Cpdn news
ID: 48063 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4342
Credit: 16,502,075
RAC: 5,584
Message 48071 - Posted: 30 Jan 2014, 16:17:18 UTC

Not sure if this is valid or not
Name hadcm3n_7jcv_1980_40_008436370_3
Workunit 8587226
It doesn't seem to be marked, "no resubmission" but another in that workunit is and it does have the 2023 deadline. Mind you It will be a few days till one of the other tasks I am crunching finishes anyway.
ID: 48071 · Report as offensive     Reply Quote
Profile JIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,063,325
RAC: 928
Message 48074 - Posted: 31 Jan 2014, 1:54:46 UTC

I hate to be the barer bad tidings, but, the 2023 deadline is a give-way. It is almost certainly a bad WU and I would abort it.

ID: 48074 · Report as offensive     Reply Quote
Professor Desty Nova
Avatar

Send message
Joined: 19 Sep 04
Posts: 92
Credit: 1,935,344
RAC: 381
Message 48075 - Posted: 31 Jan 2014, 10:34:59 UTC

And having "No Resubmission" at the top of the workunit info is also not a good sign (http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=8587226).




Professor Desty Nova
Researching Karma the Hard Way
ID: 48075 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 48077 - Posted: 31 Jan 2014, 21:08:52 UTC - in response to Message 48076.  

The project is in the final stages of the major upgrade. Still a fair bit to do, but it's getting there.
It wasn't helped by the recent major failure of part of the university computer network where our servers are located.


Backups: Here
ID: 48077 · Report as offensive     Reply Quote
Profile JIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,063,325
RAC: 928
Message 48078 - Posted: 31 Jan 2014, 22:45:34 UTC

Judging from the count in the �Tasks in Progress� on the �Server Status� Page I think that we may have had a small release of new WU�s overnight. Probably only 2000 or 3000. Hopefully, these are good ones. With all the hungry computers out there they didn�t last long. Unfortunately, I didn�t get one.

ID: 48078 · Report as offensive     Reply Quote
Profile Alan K

Send message
Joined: 22 Feb 06
Posts: 484
Credit: 29,602,471
RAC: 2,231
Message 48137 - Posted: 10 Feb 2014, 16:02:33 UTC

I have recently picked up hadcm3n_7x8g_1980_40_008454355 but it has a completion date of 1ts May 2014 rather than 2023. Is this one of the rogue batch to be aborted or should I let it run anyway?
ID: 48137 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4342
Credit: 16,502,075
RAC: 5,584
Message 48138 - Posted: 10 Feb 2014, 16:09:01 UTC - in response to Message 48137.  

Completion date may 2014 suggests it's not one of the rogue batch. Also if you look at the work unit link, http://climateapps2.oerc.ox.ac.uk/cpdnboinc/forum_reply.php?thread=7671&post=48137&no_quote=1#input It isn't marked, "No Resubmission" so you should be all right on this one.
ID: 48138 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2167
Credit: 64,488,748
RAC: 4,577
Message 48139 - Posted: 10 Feb 2014, 16:09:45 UTC - in response to Message 48137.  

I have recently picked up hadcm3n_7x8g_1980_40_008454355 but it has a completion date of 1ts May 2014 rather than 2023. Is this one of the rogue batch to be aborted or should I let it run anyway?

No, the first task from that work unit was issued in September, so it is not one of the bad batch.
ID: 48139 · Report as offensive     Reply Quote
Profile Alan K

Send message
Joined: 22 Feb 06
Posts: 484
Credit: 29,602,471
RAC: 2,231
Message 48148 - Posted: 11 Feb 2014, 9:51:51 UTC - in response to Message 48139.  

Thanks. Just checking.
ID: 48148 · Report as offensive     Reply Quote
Barblovesroses

Send message
Joined: 20 May 10
Posts: 13
Credit: 55,033
RAC: 0
Message 48152 - Posted: 11 Feb 2014, 19:09:38 UTC - in response to Message 48148.  
Last modified: 11 Feb 2014, 19:14:27 UTC

This has been a very informative thread for me and makes me wish that had been reading in the thread much earlier on than just now. I have aborted a lot of my earlier jobs because they still had a lot of hours to run when the due date came and I probably should have checked with someone instead to see if I should have let it keep running or not.

I am approaching a deadline on a job now - the deadline is the 13th @ 3:38 am and the task will not complete before the deadline - physically impossible to run 180+ hours in less than 48. The task is had3mcn_022u_2020_40_008398650_3. Another thing about the task deadline is that it keeps growing in hours - a few days ago it was 172 hours left to run and now is over 180 so I really don't know how long it will take for the job to finish.

So, do I let the job keep running or abort it like I have done with almost every previous job I have done with the project?

It seems such shame to run a project for so many hours and then abort due to time expiration...
ID: 48152 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 48153 - Posted: 11 Feb 2014, 20:08:39 UTC - in response to Message 48152.  

There's no deadline. The number that's reported as one, is just there because the BOINC software requires one.

For this project, a computer can take a couple of years if necessary. The only problem then, is that the model in question will get a red message put into "status" field on it's page on the server, and the server software will re-issue it to some one else.

The gradual increase in the completion time would probably be caused by the computer running it, not running sufficient hours a day, or being very slow, or the model being swapped out for work from other projects. (Which is effectivly the same as the first reason.)

Just keep plodding on.


Backups: Here
ID: 48153 · Report as offensive     Reply Quote
Barblovesroses

Send message
Joined: 20 May 10
Posts: 13
Credit: 55,033
RAC: 0
Message 48157 - Posted: 13 Feb 2014, 10:53:10 UTC - in response to Message 48153.  

Thanks for the info Les.

I don't know about your explanations for the added hours. The project has been running continuously on high priority status and hasn't stopped except for when I rebooted my computer a couple times earlier today. I had just over 180 hours yesterday and now the project has 194 hours remaining to run so I don't understand how it can add about 14 hours in one day's time under the explanation you gave.

It just doesn't sound right to me.

Anyway, I am continuing to let it run. If it continues to increase hours at this same rate I will report additional hour increases as something strange seems to be occurring here.
ID: 48157 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 48158 - Posted: 13 Feb 2014, 11:26:36 UTC - in response to Message 48157.  

Well, there is another possibility: the model is in an indefinite loop.

You'll need to look at the numbers in the bottom left corner of the Show graphics page, and watch them or write them down. Check every now and then to see if they go back to earlier numbers and then repeat.



Backups: Here
ID: 48158 · Report as offensive     Reply Quote
Barblovesroses

Send message
Joined: 20 May 10
Posts: 13
Credit: 55,033
RAC: 0
Message 48168 - Posted: 14 Feb 2014, 11:35:11 UTC - in response to Message 48158.  

OK, here are 3 readings that I've taken from the screen as you suggested:

2/13 noon

582301 of 1039392
56.03%
hours of computing
469.27.28
_______
2/13 7:38 PM
590.551
56.82%

475.38.04
_______
2/14 6:20 am
601,153
57.84% 463.22.58

It does appear to be progressing to me in sequence within the task itself however its still adding time onto the job so maybe this is normal and the 14 hours yesterday was a fluke. The job now has 197 hours remaining to completion and 811 completed so it has added an additional 3 hours onto the task since my last message to you...but thats fewer than the 14 from the previous day!

I'll keep watching and see what happens from here unless you have any other thoughts.

ID: 48168 · Report as offensive     Reply Quote
Profile JIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,063,325
RAC: 928
Message 48170 - Posted: 14 Feb 2014, 22:47:09 UTC
Last modified: 14 Feb 2014, 22:51:50 UTC

Keep running the WU (24/7 if possible) and keep watching the model dates. It should be constantly progressing. What you are looking for is if it suddenly regresses by several years. This would indicate that it was stuck in a loop. It can go round and round in these loops forever. This happens to these models sometimes and there is nothing that you can do to fix it.

Good Luck.

P.S. this is one reason that they have to fix the graphics on the new 7.22 Hadam3p_pnw models. Without visible dates it is extremely hard to know if a model is progressing normally.
ID: 48170 · Report as offensive     Reply Quote
Profile JIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,063,325
RAC: 928
Message 48213 - Posted: 22 Feb 2014, 14:39:56 UTC

Tasks in progress is down to 38,331. That is the lowest that I have ever seen it. In 3 hours I will have an empty core to fill.
Hopefully more are on the way soon.

ID: 48213 · Report as offensive     Reply Quote
old_user608497

Send message
Joined: 31 Dec 09
Posts: 12
Credit: 17,214
RAC: 0
Message 48214 - Posted: 22 Feb 2014, 15:06:57 UTC

Hi there!

Finally I've got hold of a CPN work unit (hadcm3n_7zue_1980_40_008457737) but having read this thread I'm a bit concerned now. My wu is one of the hadcm3n_7 series and there seem to be problems with those. However, it has a deadline of May 2014, it isn't tagged with "no resubmission" and it says it was originally submitted in September 2013. Am I right to assume that it's OK to run? I don't want to waste any computer time and energy.

ID: 48214 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : Number crunching : VANISHING WU'S

©2024 climateprediction.net