| Author | Message |
|
|
|
Getting the following:
8/31/2012 12:31:24 AM | climateprediction.net | Started upload of hadam3p_eu_97xa_1969_1_008158384_0_1.zip
8/31/2012 12:31:26 AM | climateprediction.net | Temporarily failed upload of hadam3p_eu_97xa_1969_1_008158384_0_1.zip: transient HTTP error
8/31/2012 12:31:26 AM | climateprediction.net | Backing off 3 min 9 sec on upload of hadam3p_eu_97xa_1969_1_008158384_0_1.zip
8/31/2012 12:33:33 AM | climateprediction.net | Started upload of hadam3p_eu_97xa_1969_1_008158384_0_1.zip
8/31/2012 12:33:35 AM | climateprediction.net | Temporarily failed upload of hadam3p_eu_97xa_1969_1_008158384_0_1.zip: transient HTTP error
8/31/2012 12:33:35 AM | climateprediction.net | Backing off 6 min 7 sec on upload of hadam3p_eu_97xa_1969_1_008158384_0_1.zip
8/31/2012 12:33:40 AM | climateprediction.net | Started upload of hadam3p_eu_97xa_1969_1_008158384_0_1.zip
8/31/2012 12:33:41 AM | climateprediction.net | Temporarily failed upload of hadam3p_eu_97xa_1969_1_008158384_0_1.zip: transient HTTP error
8/31/2012 12:33:41 AM | climateprediction.net | Backing off 8 min 25 sec on upload of hadam3p_eu_97xa_1969_1_008158384_0_1.zip
Checking the server status page, one of the upload servers shows as not running, the other 2 upload servers are up. Things are trickeling, but data can't upload....
____________
 |
|
|
|
|
|
Getting similar errors but not many waiting uploads so far.
Server status page shows "uploader1.atm" as down.
Staff probably aware already - it's already after 9AM in the prime time zone.
____________
|
|
|
|
|
|
Information from staff:
The hard disk running the operating system on uploader1.atm has failed and needs to be replaced. We have ordered a new disk which will arrive on Monday and be installed on that day. So at the moment this machine is shut down and won't be up-and-running until Monday, I am afraid.
That will affect, at least, the intermediate (_1 to _12) file uploads for EU regional models, possibly others too. |
|
|
|
|
|
"The hard disk running the Operating System" WTF?
This is one of the looniest postings I've ever seen here.
Any serious server installation has at least a mirror of the OS for backup or alternative boot and OS on whatever of several physical drives -- whether IBM mainframe or my local mini-cluster or the cloud we are all expected to trust, or a lousy backup boot partition on Linux.
"The hard disk running the OS" what could that hard disk possibly be?
Are we trusting all this compute power to the power of the "C: drive"
And how would replacing the bare disk fix the loss of the OS --
Sorry for the rant, but the explanation makes no sense whatsoever at all - and makes the support team there look like total idiots - which I know they are not.
Yes - the compete explanation would cover a lot of techie stuff that would bore most of us to tears -- but the nonsensical explanation posted is -so -dumb.
Me -- sometimes the project has problems - as far as I can see the problems get fixed within a week -- no data ever lost. Last 6 years or so. I keep on contributing -- no regrets.
But "need an OS disk to keep running" - Sorry about that but is so idiotic -- could have been a totally uninformed politician posting that.
Please don't BS us who contribute.
Maybe - "the team waits for hardware to fix the problem"
might be plausible --
"Need an OS disk" obviously makes fools of us all.
In any case- keep on crunching - the crew have done wonders - and keep on doing so --
\
But - nonsensical pretend explanations of problems are losers in the long run.
____________
|
|
|
|
|
|
Assuming a raid system, if one of the hard disks had failed, it might well shut down as a precaution, if the second disk in a 2 disk raid system also went that might cause data loss so they would be awaiting a new disk to rebuild the array.
I have never used raid, just been rather paranoid about backing up important stuff so this is purely based on my reading not experience lol.
Dave |
|
|
|
|
|
Now the South African download server is down, why doesn't that surprise me? The techs at Oxford could care less about this project. The whole worlds watching them, I hope they never put it on their résumé. |
|
|
|
|
|
Actually, I believe that the techs on this project are doing a very good job.
The limited funding for the research puts them in a position where they can't have what most of us "techies" just assume is normal. They have to do the best they can with what they've got, and that's not a lot.
Mirrored drives for the OS - we see that's not true. Spare disc drives just laying around or online already waiting for a problem - obviously not so. Redundant SAN with no SPOF anywhere and automatic failover to a backup system - at least a year or two worth of storage waiting on-line already? Don't think so.
Maintenance contract with (big database company that will fix any problems in 24 hours provided that you have enough spare backup hardware pre-certified?)
Heh- all that could be fixed with less than 25 million euros - rough guess. Maybe 50. (not counting the service contracts with the vendors)
The tech support at the project are supporting - not only the hardware - but more important and invisible to us volunteers - they are supporting the access to the work we have done - the database - for researchers worldwide.
Understaffed, overworked, with more job demands than anything I ever did as a techie. (Hardware, software, database, application expertise - that would be at least 8 FTEs at even the cheapest shop I ever worked in)
My earlier rant about the ongoing problems with servers should be interpreted as me venting my frustration with the whole situation -
NOT as an accusation of the understaffed and underfunded crew.
____________
|
|
|
|
|
|
Totally agree Erik! Two Techies there to do the job. If they had your estimate of eight and they were the same quality as those they have and those eight had the money to buy the hardware they wanted ........ I don't think we would see many of the problems we do....... Or maybe they would just try and do 4 times as much, succeed and still get as many complaints?
Dave |
|
|
|
|
|
Uploads are working slowly - expect will catch up next 3-4 hours.
Thx Dave - yeah volunteer here a few years the temporary failures of hardware are annoying but no big deal - wait a few days or week at worst and all the work gets uploaded and distributed eventually. Nothing ever lost.
Once happened that a misconfig and load of crap wu's got my goat by wasting my limited bandwidth , that was a while ago.
Main point is - most contributors never notice a week's downtime on the upload server. Last time I looked the "top -- whatever" - computers - they were wasting wu's a mile a minute -
So - thanks - let's keep the osmolality of the effluent minimal when we post here, and keep on crunching -- it's worth doing. Apologize for any flaming I've done.
And - to all - complain, bitch and worry -- if there's ever a problem -- it might be an old moldy problem - but it might be a new problem - and reporting such a problem might very well save all of us volunteers a lot of wasted effort -
So - If you read this board - all complaints are welcome !! :):) - the Mods welcome the chance to help all problems !! :):
Actually, they do help a lot -- thanks
PS - I am not MOD, never will be, but thanks to them all
____________
|
|
|
|
|
|
7 Sept 2012, 05:36 UTC;
upload disk full error message started to appear 6 Sept 2012 at 22:23 UTC
Server status page indicates server is up and running
Just thought you would like to know.
____________
|
|
|
|
|
|
Thanks. Confirming what you reported. Same here.
____________
|
|
|
|
|
|
I am getting the same on an eu model. saf model which goes to a different server is fine. They should be starting work about now in Oxford so I assume we will see some action this morning.
Dave |
|
|
|
|
|
Confirming that I am also getting upload failures repeatedly. In itself that does not worry me, but it does chew through my upload quota at a great rate. Is tehre any way to disable the upload for a while? (This is a completed task, and I have other tasks running, so I do not want to just disable network traffic.)
____________
|
|
|
|
|
Confirming that I am also getting upload failures repeatedly. In itself that does not worry me, but it does chew through my upload quota at a great rate. Is tehre any way to disable the upload for a while? (This is a completed task, and I have other tasks running, so I do not want to just disable network traffic.)
You could "disable network activity" on one of the tabs in the manager --
BUT -- seems that uploads are working again, so try that option later.
OH gorgonzola and other cheeses -- so overwhelmed with backlog uploads now -- just wait a few hours.
____________
|
|
|
|
|
|
Just to confirm that an eu zip file went through at 10:54 on one machine and two more have gone through since so issue seems resolved apart from my curiosity - in the past when the disk has filled up it has taken several hours to transfer the data before the disk has come back on line again. Seems suspiciously quick for it to have really filled up.
Dave |
|
|
|
|
|
could redirecting the url for the uploadhandler in the hosts file to say 127.0.0.0 be an option? |
|
|
|
|
|
Problems with uploader1 both up and down . Friday of course.
____________
|
|
|
|
|
Problems with uploader1 both up and down . Friday of course.
I let the project people know, but like you say it's Friday. Hopefully it'll get fixed early next week. |
|
|
|
|
|
My three waiting uploads have all gone, however the server keeps going back to red every so often on the server status page.
Dave. |
|
|
|
|
|
Yup - the server goes on and off. Has uploaded a few dozen files from here.
All what I worry about is if the uploads get lost - however many days it takes to get the job done is not a problem. Losing data is the possible problem - but that has never happened as far as I know - long delays happen when server is catching up.
I run 6 machines - right now 3 have network disabled - the other 3 are uploading slowly from time to time. Won't enable network for the other 3 until the online ones clear their queues. Might be a while.
The important thing is not to lose the uploads. Patience is a virtue.
____________
|
|
|
|
|
|
Jonathan only cleared 750 Gigs of space over the weekend. He's currently looking for a cupboard with some spare shelf space to store some more. Data is stacked up everywhere. Probably have to buy some buckets for it. :)
____________
Backups: Here |
|
|
|
|
|
Buckets that size don't come cheap. |
|
|
|
|
|
At SETI@home volunteers are donating dozens of 1TB and 2TB disks to store data.
Tullio
____________
|
|
|
|
|
|
At SETI@home volunteers are donating dozens of 1TB and 2TB disks to store data.
I don’t know if this would work with this project. How do you guard against data loss.
When you say that they “donate” I assume that the drives remain in the homes of the donor. Home, non-commercial quality HD’s are not know for there overwhelming reliability. I had a 2TB external backup drive fail only a few months ago. One moment it worked, a few hours later it didn't. No warning. Also what happens if a person who has project data just suddenly stops participating.
One thing that can be said for CP is that despite all our server problems we have NEVER LOST DATA!
____________
|
|
|
|
|
|
No, by donating Tulio means buying them and sending them on to Berkely. |
|
|
|
|
|
All servers at Oxford, and there are many different departments, with server rooms and IT sections, would most likely be under a service contract.
And crunchers don't need to know all the 'behind the scenes' details and plans.
____________
Backups: Here |
|
|
|
|
All servers at Oxford, and there are many different departments, with server rooms and IT sections, would most likely be under a service contract.
And crunchers don't need to know all the 'behind the scenes' details and plans.
Also, there are various upload (and download) servers worldwide both for current wu and for the database of completed results.
So it's not just "distributed computing" - it's "distributed database"
Like JIM posted - no uploaded results lost in 8 years.
It's possible to build fairly reliable systems from consumer-grade discs - but takes a lot of planning, design and maintenance. Donating cheap hardware to the project might help, probably not - don't know what SETI is doing. There's lots of information on the web on how to do it - but the devil is in the details. And the work-hours of maintaining such a thing is -- done that- don't want to again - retired.
Like Les said -- I don't want to know the details -- because I've been there - and second-guessing future storage improvements and estimated total costs and all is a total brain-bender and management always complains anyhow no matter how hard you work to design and build a thing that will be obsolete before the Board of Directors signs off.
Thanks to the crew for keeping things going mostly, and for not losing any uploaded data.
____________
|
|
|
|
|
|
There is a donate button http://climateprediction.net/content/donations on the main project page that could probably do with more publicity. I suspect that more people working for the project might be a higher priority than extra hard drives but as Les says, us crunchers don't need to know all the details and if we did we would probably be overwhelmed!
Dave |
|
|
|
|
There is a donate button http://climateprediction.net/content/donations on the main project page that could probably do with more publicity. I suspect that more people working for the project might be a higher priority than extra hard drives but as Les says, us crunchers don't need to know all the details and if we did we would probably be overwhelmed!
Dave
Overwhelmed - no way
If the project could pay me a lousy USD 120000 per year and give me another few million for hardware I could fix all their problems (add a few consultants on the database side) I'd even come out of retirement!
Might try the "donate" button
____________
|
|
|
|
|
|
Totally understand!
Dave |
|
|
|
|
No, by donating Tulio means buying them and sending them on to Berkely.
Correct. It is the GPU User Group, that is those using graphic cards to accelerate their processing, that sponsors donations, orders disks and also servers, and sends them to the Space Sciences Laboratory. I think it is the only BOINC project where this happens.
Tullio
____________
|
|
|
|
|
No, by donating Tulio means buying them and sending them on to Berkely.
Correct. It is the GPU User Group, that is those using graphic cards to accelerate their processing, that sponsors donations, orders disks and also servers, and sends them to the Space Sciences Laboratory. I think it is the only BOINC project where this happens.
Tullio
Got any more info - or link? sounds possibly useful.
____________
|
|
|
|
|
No, by donating Tulio means buying them and sending them on to Berkely.
Correct. It is the GPU User Group, that is those using graphic cards to accelerate their processing, that sponsors donations, orders disks and also servers, and sends them to the Space Sciences Laboratory. I think it is the only BOINC project where this happens.
Tullio
Got any more info - or link? sounds possibly useful.
GPU User Group - www.gpuug.org
Users can donate towards a specific purpose, or they can buy a drive, or even donate directly to the project. The last one is done via Paypal and at the end of the month the project gets a payment less Paypal fees. UC Berkeley aren't allowed a Paypal account themselves.
I did ask Jonathan if he could tell me how much a 2Tb drive costs in the UK so I could work out a suitable donation but he hasn't provided me with any information (probably rather busy I expect). If he or someone else could tell us that and how many drives they need we could work towards that goal. The idea is to have smallish goals for specific items, something achievable.
____________
BOINC blog |
|
|
|
|
|
It's probably university policy to not discus money matters with people not working for the uni. A commercial-in-confidence type of thing.
And a hard disk isn't of much use without a server to run it.
And servers need rack space, and power.
____________
Backups: Here |
|
|
|
|
It's probably university policy to not discus money matters with people not working for the uni. A commercial-in-confidence type of thing.
And a hard disk isn't of much use without a server to run it.
And servers need rack space, and power.
All very true. Servers paddym and georgem were also donated. Rack space and power were not. But the SETI@home devs/admins provided them. They are also volunteers for SETI@home besides doing work for UC Berkeley.
Tullio
____________
|
|
|
|
|
It's probably university policy to not discus money matters with people not working for the uni. A commercial-in-confidence type of thing.
And a hard disk isn't of much use without a server to run it.
And servers need rack space, and power.
I was assuming they would be replacing existing drives with something that would be newer (and theoretically more reliable) as well as possibly giving them more space, depending on what size drives they are replacing. A rough idea of how much drives cost (recommended retail price) and how many they want/need to replace would have been helpful. If that is too much information then how can they expect us to help?
____________
BOINC blog |
|
|
|
|
It's probably university policy to not discus money matters with people not working for the uni. A commercial-in-confidence type of thing.
And a hard disk isn't of much use without a server to run it.
And servers need rack space, and power.
I was assuming they would be replacing existing drives with something that would be newer (and theoretically more reliable) as well as possibly giving them more space, depending on what size drives they are replacing. A rough idea of how much drives cost (recommended retail price) and how many they want/need to replace would have been helpful. If that is too much information then how can they expect us to help?
Searching the web -- SAS drives in the 300 GB capacity range at 15k rpm are running a few hundred dollars each - a bit less if you buy case lots. 600GB drives in this speed and reliablity range are a bit more expensive per TB.
Consumer grade 1-2 TB drives are cheaper by far per TB but need much more expertise and connectivity and replication to make them competitive for enterprise reliability needs. And less than half the read speed and even less seek speed. So you need a database analyst and some serious testing to compare the multi-redundant SAS to the even-more-redundant cheap disks you would need to mimic the speed and reliability of the "server-grade" disks.
What I'm saying is -- speed, reliability, redundancy -- takes a lot of work to figure what's best for any particular application.
Not to mention connectivity - you want dual-port SAS drives so when one network server fails the backup system works -- another couple hundred per drive - and consumer-grade 2 TB drives don't even offer this option.
It's not about replacing a few drives with newer cheaper ones.
It's about building a reliable replacement system or 2 or 3
____________
|
|
|
|
|
|
What you say is true for data that must remain readily accessible, Eirik. But it seems to me (from the outside) that CPDN's main requirement is for somewhere to put data that no-one has wanted during the last few months, and that is unlikely to be wanted for the next few months or years -- but it might be wanted sometime. Most likely, when a scientist does want it, they'll be able to give plenty of notice.
Back in the day, IBM used to sell the concept of tiered storage: on-line, near-line and off-line. The idea was that 'hot' data would stay on the on-line storage, and when people stopped accessing it it would migrate to progressively less responsive (but cheaper) storage.
Of course IBM sold fancy systems to 'migrate' unneeded data automatically. But I don't think CPDN needs that. It does need some kind of systematic archiving process, though.
I'd caution that archiving is an ongoing process, not a one-time event, and resources should be allocated and processes set up accordingly.
For non-critical data such as CPDN run results, two copies on consumer-grade storage, kept in separate file store-rooms in separate buildings in separate campuses and tested annually, should provide enough of a guarantee of future accessibility.
100 TB of non-critical offline storage is then some checksum files, a hard-back book, a label maker, 100+ 2TB disks and a USB3 dock, and two cupboards -- plus a high-school student volunteer for a few weeks each year (to stock-take and checksum the archives, replace any failed disks and archive new data). And the instructions for the student. |
|
|
|
|
|
The data is permanently on line, and can be accessed via this page.
Each model completed, is also linked to the results pages, via a line at the bottom of each model's page.
It needs to also be remembered that there's no 'cpdn section' at Oxford uni.
The research is a research project of the Atmospheric, Oceanic and Planetary Physics department.
The 2 "programmers" are IT specialists / programmers who, with others, work for the Oxford e-Research Centre.
The Oxford e-Research Centre works with research units across the whole of Oxford University to enable the use and development of innovative computational and information technology in multidisciplinary collaborations.
As such, it can be assumed that they know about many things, including large data bases, and various storage schemes. In fact, I can vaguely recall reading the job specs for one of these positions a few years ago, which talked about these same things as part of the job requirements.
____________
Backups: Here |
|
|
|
|
|
I wonder if the BOINC volunteer storage, if they ever get it completed, would be useful here. I would post a link but the Akismet anti-spam is so paranoid I can't link to it. Suffice to say its at:
boinc dot berkeley dot edu slash trac slash wiki slash VolunteerStorage#
____________
BOINC blog |
|
|
|
|
|
I presume the server filled up again over the weekend? |
|
|
|
|
|
Yes. Message in the News thread a couple of days ago.
____________
Backups: Here |
|
|
|
|
|
Thanks, sorry, not paying attention! Normally I spot the news posts.
Dave |
|
|
|
|
|
Jonathan has been working on it.
But the biggest server for moving the data to has a disk problem now. And it takes a long time to 'chunk' the data for moving, verify that it's OK after the move, and then re-link each model to the research area.
There's terabytes to move, the university net isn't particularly fast, there was a network failure, and most of the IT people from all over Oxford took off for 'more interesting places' as soon as Long Vacation started.
____________
Backups: Here |
|
|
|
|
|
Les, I understand -- happens often enough over here - as soon as I spotted the upload problem, I suspended my Climate apps and let other applications cycle along. I long ago learned that one should have two or three applications running on a workstation for each the CPU apps and the GPU apps.
|
|
|
|
|
|
thanks Les,
is there anything we Crunchers should do with our BOINC client ?
I wish Jonathan well.
for those who missed Jonathan post.
<quote>
We suffered a brief network outage today, which prevented connections to or from various CPDN servers.
The fault developed at approximately 2 pm BST and continued for two hours.
The hardware responsible is due to be replaced imminently, but the project is 'at risk' until that has been done (probably for another 12 hours).
Jonathan Miller
CPDN SysAdmin
</quote>
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/forum_thread.php?id=5447&nowrap=true#44898
25/09/2012 12:51:19 PM | climateprediction.net | Started upload of hadam3p_eu_wi36_1978_1_007215635_2_9.zip
25/09/2012 12:51:22 PM | climateprediction.net | [error] Error reported by file upload server: can't open file
25/09/2012 12:51:22 PM | climateprediction.net | Temporarily failed upload of hadam3p_eu_wi36_1978_1_007215635_2_9.zip: transient upload error
25/09/2012 12:51:22 PM | climateprediction.net | Backing off 3 min 1 sec on upload of hadam3p_eu_wi36_1978_1_007215635_2_9.zip
25/09/2012 12:54:18 PM | climateprediction.net | Started upload of hadam3p_eu_wkar_1963_1_007216497_1_9.zip
25/09/2012 12:54:20 PM | climateprediction.net | [error] Error reported by file upload server: can't open file |
|
|
|
|
is there anything we Crunchers should do with our BOINC client ?
Set the project to No new tasks to stop polling for work
Suspend all climate models so as not to add more zips that won't upload.
Set Network activity suspended if possible to completely stop talking to the project.
____________
Backups: Here |
|
|
|
|
|
No more fun makes it. Constantly there are problems with the servers.
Also it does not get the team ready, finally, an application GPU to provide.
Climate was sometimes my favorite project.
Luckily there are still other scientific projects.
____________
Wiki German Language, Wiki in deutscher Sprache
View
|
|
|
Jonathan MillerForum moderator Project administrator Project developer Volunteer developer Send message Joined: Mar 28 11 Posts: 24 Credit: 82,588 RAC: 0
|
|
Hi,
We have issues on all three of our storage servers at the moment.
Currently Uploader1.atm is full, and the two machines who would normally receive her excess files are suffering from disk issues.
cpdn-upload2.oerc is one of the machines above, so she cannot currently receive uploads.
We are waiting on a fix - I suspect it is to do with the network outage that OeRC suffered yesterday afternoon (2 - 4 pm BST, 25 Sept 1012).
|
|
|
|
|
|
thanks for the update Jonathan. Best Wishes Byron.
|
|
|
|
|
|
Set the project to No new tasks to stop polling for work.
Done
Set Network activity suspended if possible to completely stop talking to the project.
Done
Suspend all climate models so as not to add more zips that won't upload.
I'm not sure on how I do this ?
Could you please provide details on how I do this ?
|
|
|
|
|
|
One way would be:
In BOINC Manager, select Projects tab
Select climateprediction.net project
Click Suspend button |
|
|
|
|
Suspend all climate models so as not to add more zips that won't upload.
I'm not sure on how I do this ?
Could you please provide details on how I do this ?
If you're content to turn off network activity, as you have already done, then there is no need to suspend the models themselves, since the Zip files will simply accumulate until network activity is turned on again. Accumulation of Zip files is not normally a problem, it's 1000's of machines trying and failing to upload them to the affected server that's the problem.
If, however, you didn't want to turn network activity off because, for example, you are running other projects, then it might be a good idea to suspend the CPDN models in order to stop more Zips being generated and failing to upload. To do that, just select the model in the BOINC Manager 'Tasks' tab and press the 'Suspend button'; or select climateprediction.net in the 'Projects' tab and press the 'Suspend' button. The latter option will stop any CPDN tasks running, which may not be what you want, as it's only the HADAM3P EU models that are having upload problems: my PNW models have cleared without any problems. |
|
|
|
|
|
I cannot suspend network activity, I have other 6 BOINC projects. I've put NNT.
Tullio
____________
|
|
|
|
|
|
Thanks Iain and thanks Lockley for responding to my post.
but just a few minuets ago:
it looks like things are back up and running ?
26/09/2012 5:54:28 AM | climateprediction.net | Sending scheduler request: To send trickle-up message.
26/09/2012 5:54:28 AM | climateprediction.net | Not reporting or requesting tasks
26/09/2012 5:54:31 AM | climateprediction.net | Scheduler request completed
26/09/2012 5:54:34 AM | climateprediction.net | Started upload of hadam3p_eu_wkar_1963_1_007216497_1_11.zip
26/09/2012 5:56:04 AM | climateprediction.net | Finished upload of hadam3p_eu_wkar_1963_1_007216497_1_11.zip
26/09/2012 5:56:05 AM | climateprediction.net | Started upload of hadam3p_eu_wi36_1978_1_007215635_2_9.zip
26/09/2012 5:56:05 AM | climateprediction.net | Started upload of hadam3p_eu_wkar_1963_1_007216497_1_9.zip
26/09/2012 5:57:39 AM | climateprediction.net | Finished upload of hadam3p_eu_wi36_1978_1_007215635_2_9.zip
26/09/2012 5:57:39 AM | climateprediction.net | Finished upload of hadam3p_eu_wkar_1963_1_007216497_1_9.zip
26/09/2012 5:57:39 AM | climateprediction.net | Started upload of hadam3p_eu_wjrj_1991_1_007208963_2_9.zip
26/09/2012 5:59:06 AM | climateprediction.net | Finished upload of hadam3p_eu_wjrj_1991_1_007208963_2_9.zip
26/09/2012 6:00:10 AM | climateprediction.net | Started upload of hadam3p_eu_wi36_1978_1_007215635_2_11.zip
26/09/2012 6:01:43 AM | climateprediction.net | Finished upload of hadam3p_eu_wi36_1978_1_007215635_2_11.zip
26/09/2012 6:05:44 AM | climateprediction.net | Started upload of hadam3p_eu_wi36_1978_1_007215635_2_10.zip
26/09/2012 6:07:16 AM | climateprediction.net | Finished upload of hadam3p_eu_wi36_1978_1_007215635_2_10.zip
26/09/2012 6:29:14 AM | climateprediction.net | Started upload of hadam3p_eu_wjrj_1991_1_007208963_2_11.zip
26/09/2012 6:30:46 AM | climateprediction.net | Finished upload of hadam3p_eu_wjrj_1991_1_007208963_2_11.zip
26/09/2012 6:55:10 AM | climateprediction.net | Sending scheduler request: To send trickle-up message.
26/09/2012 6:55:10 AM | climateprediction.net | Not reporting or requesting tasks
26/09/2012 6:55:13 AM | climateprediction.net | Scheduler request completed
26/09/2012 6:56:44 AM | climateprediction.net | Started upload of hadam3p_eu_wkar_1963_1_007216497_1_10.zip
26/09/2012 6:58:14 AM | climateprediction.net | Finished upload of hadam3p_eu_wkar_1963_1_007216497_1_10.zip
26/09/2012 7:43:59 AM | climateprediction.net | Started upload of hadam3p_saf_0xoa_1969_1_006876818_2_12.zip
26/09/2012 7:44:22 AM | climateprediction.net | Finished upload of hadam3p_saf_0xoa_1969_1_006876818_2_12.zip
26/09/2012 7:53:30 AM | climateprediction.net | Started upload of hadam3p_saf_0xoa_1969_1_006876818_2_13.zip
26/09/2012 7:53:33 AM | climateprediction.net | Computation for task hadam3p_saf_0xoa_1969_1_006876818_2 finished
26/09/2012 7:53:33 AM | climateprediction.net | Starting task hadam3p_eu_w4nd_1985_1_007212256_2 using hadam3p_eu version 609 in slot 1
26/09/2012 7:55:50 AM | climateprediction.net | Sending scheduler request: To send trickle-up message.
26/09/2012 7:55:50 AM | climateprediction.net | Not reporting or requesting tasks
26/09/2012 7:55:56 AM | climateprediction.net | Scheduler request completed
26/09/2012 7:57:05 AM | climateprediction.net | Finished upload of hadam3p_saf_0xoa_1969_1_006876818_2_13.zip
26/09/2012 8:00:05 AM | climateprediction.net | Started upload of hadam3p_eu_wjrj_1991_1_007208963_2_10.zip
26/09/2012 8:01:34 AM | climateprediction.net | Finished upload of hadam3p_eu_wjrj_1991_1_007208963_2_10.zip
26/09/2012 8:03:44 AM | climateprediction.net | Started upload of hadam3p_saf_0z6f_1998_1_006888367_2_12.zip
26/09/2012 8:04:07 AM | climateprediction.net | Finished upload of hadam3p_saf_0z6f_1998_1_006888367_2_12.zip
26/09/2012 8:13:08 AM | climateprediction.net | Started upload of hadam3p_saf_0z6f_1998_1_006888367_2_13.zip
26/09/2012 8:13:12 AM | climateprediction.net | Computation for task hadam3p_saf_0z6f_1998_1_006888367_2 finished
26/09/2012 8:13:12 AM | climateprediction.net | Starting task hadam3p_pnw_z862_1985_1_006941106_2 using hadam3p_pnw version 609 in slot 3
26/09/2012 8:17:01 AM | climateprediction.net | Finished upload of hadam3p_saf_0z6f_1998_1_006888367_2_13.zip
26/09/2012 8:34:43 AM | climateprediction.net | Started upload of hadam3p_saf_13xn_1970_1_006904131_1_12.zip
26/09/2012 8:35:06 AM | climateprediction.net | Finished upload of hadam3p_saf_13xn_1970_1_006904131_1_12.zip
26/09/2012 8:44:08 AM | climateprediction.net | Started upload of hadam3p_saf_13xn_1970_1_006904131_1_13.zip
26/09/2012 8:44:12 AM | climateprediction.net | Computation for task hadam3p_saf_13xn_1970_1_006904131_1 finished
26/09/2012 8:44:12 AM | climateprediction.net | Starting task hadam3p_saf_110z_1994_1_006890763_1 using hadam3p_saf version 609 in slot 2
26/09/2012 8:44:39 AM | climateprediction.net | update requested by user
26/09/2012 8:44:44 AM | climateprediction.net | Sending scheduler request: Requested by user.
26/09/2012 8:44:44 AM | climateprediction.net | Reporting 2 completed tasks, requesting new tasks for CPU and NVIDIA, sending trickle-up message
26/09/2012 8:44:46 AM | climateprediction.net | Scheduler request completed: got 0 new tasks
26/09/2012 8:44:46 AM | climateprediction.net | Project has no tasks available
26/09/2012 8:47:57 AM | climateprediction.net | Finished upload of hadam3p_saf_13xn_1970_1_006904131_1_13.zip
26/09/2012 8:52:48 AM | climateprediction.net | update requested by user
26/09/2012 8:52:49 AM | climateprediction.net | Sending scheduler request: Requested by user.
26/09/2012 8:52:49 AM | climateprediction.net | Reporting 1 completed tasks, requesting new tasks for CPU and NVIDIA
26/09/2012 8:52:51 AM | climateprediction.net | Scheduler request completed: got 0 new tasks
26/09/2012 8:52:51 AM | climateprediction.net | Project has no tasks available
|
|
|
|
|
|
I think Byron, that you have filled up the server again with that lot.
Wed 26 Sep 2012 17:25:57 BST | climateprediction.net | [error] Error reported by file upload server: Server is out of disk space
Wed 26 Sep 2012 17:25:57 BST | climateprediction.net | Temporarily failed upload of hadam3p_eu_2qf2_1971_1_008173014_1_12.zip: transient upload error
Wed 26 Sep 2012 17:25:57 BST | climateprediction.net | Backing off 5 hr 43 min 9 sec on upload of hadam3p_eu_2qf2_1971_1_008173014_1_12.zip
Wed 26 Sep 2012 17:26:05 BST | climateprediction.net | Started upload of hadam3p_eu_2kj0_1962_1_008189170_0_3.zip
Wed 26 Sep 2012 17:26:06 BST | climateprediction.net | [error] Error reported by file upload server: Server is out of disk space
Dave |
|
|
|
|
|
I am still getting these messages. Any word on when it might be resolved?
____________
|
|
|
|
|
I am still getting these messages. Any word on when it might be resolved? There is a problem with the server to which the data would normally be moved. No doubt when that problem is fixed the moving process will resume. |
|
|
|
|
|
I notice that some models are able to upload. My hadcm3n's seem to be uploading fine. From the 'other' board I think I read that pnw's also upload because they're going directly to sever at the Univ of WA where the project is located.
Eu mmodels, on the other hand, are completely backed up. I have 16 such files currently in the queue. However they're only 13 MB a piece; I have plenty of disk space; so I'm going to let those models continue to run.
CPDN seems clearly to be a 'set it and forget it' project. Where the contradiction comes in is that, on average, the people participating in the project are technical and it's natural that many of them would want to know more of what's going on. Of course, we do know that CPDN is chronically short-handed.
Even though I've tried to keep these remarks 'neutral', I expect someone will find something to take issue with. Such is human nature. |
|
|
|
|
|
I notice that cpdnupload2.oerc is red now. That may mean it has been taken off line while the data is transferred however that in itself will take a while as it is several TB.
Note I am not suggesting they buy some as what little I know about how things are set up at Oxford is from my reading here but I saw on Tom's Hardware the other day that someone is now selling a reasonably speedy 4TB drive.
While it might not mean it at Oxford, it would certainly solve all my space problems for a while if it were a bit cheaper. |
|
|
|
|
|
Just reporting some good news. zip files seem to be uploading. |
|
|
|
|
|
For my units the PNW units are uploading good, the EU units have been just setting here. One system is working on it's last model, hope there is new work this week.
____________
Keep on crunching Pizza@Home |
|
|
|
|
|
PNW goes directly to Uni of Oregon, USA, so don't count.
New work won't even be considered until ALL of the server problems are sorted, which may be another week yet.
Michaelmas Term starts in a weeks time. or thereabouts, so Long Vacation will finish in a few days, and all of the IT people who scarpered as soon as it started should be back soon, and dealing with problems in their various areas.
____________
Backups: Here |
|
|
|
|
|
Just when I was about to write "it's stuck with me too" it started to upload again :)
edit : well then it gets stuck again, then it restarts again... so I guess we'll have to wait for the return of the Jedi... |
|
|
|
|
|
I don’t believe that there is much that you can do to speed this up. The only real solution is to wait for the server problems to be fixed. You might suspend network activity so that the stuck zip file doesn’t keep trying to upload. If you are running other types of WU’s or other Boinc projects you can reenable network activity about once a day to let other types to upload and then resuspend.
____________
|
|
|
|
|
|
To: J. Patrick Malone
I've hidden your post to stop spammers from getting your email address.
As for mailing results back to the project, this isn't how BOINC projects work.
You'll just have to wait patiently like all of us.
If you read back through this thread, you'll find one of my earlier posts, where I listed the only steps that can be taken.
____________
Backups: Here |
|
|
|
|
|
FWIW, my long queue of EU model uploads has decreased and some of the uploads are now getting through. |
|
|
|
|
|
Here we go again :(
03/10/2012 12:01:08 | climateprediction.net | [error] Error reported by file upload server: can't write file /storage/incoming/uploader//hadam3p_eu_2r82_1972_1_008189180_0_7.zip: No space left on server
03/10/2012 12:01:08 | climateprediction.net | Temporarily failed upload of hadam3p_eu_2r82_1972_1_008189180_0_7.zip: transient upload error
03/10/2012 12:01:08 | climateprediction.net | Backing off 9 hr 1 min 29 sec on upload of hadam3p_eu_2r82_1972_1_008189180_0_7.zip
|
|
|
|
|
|
It's more a matter of "still" rather than "again".
Have you read the News thread?
It could be next week before the bulk of the uploads get through.
____________
Backups: Here |
|
|
|
|
|
But all my uploads & of others went through so I thought the problem(s) were fixed that's why the "again". Never mind.
And, of course, I've read both the News and Announcements plus the other threads (Uploads not working, Server out of disk space,...) not to mention my own topic "Permanent HTTP Error".
|
|
|
|
|
|
If everyone were to pick a day of the week and a time to enable internet activity, it would reduce the load on the server after outages. Even if quite a few people chose the same day, it would reduce the hammering when first back on line. Perhaps the information where people sign up should suggest this? I know it is nice to look at stats and see how you are doing but I am sure most of us could cope with getting our fix once a week rather than several times a day?........... |
|
|
|
|
|
Possibly, but the point is that this issue has been present for some time. I currently have 20 eu zip files unable to upload.
Hopefully we will be told when this has been resolved, on the news thread - although they did say "next week" about a week ago...
____________
Brian |
|
|
|
|
|
Just a few minuets ago while reading this thread, my 24 eu zip files are starting to upload. Yay!
|
|
|
|
|
|
Just reporting some good news. all my 24 eu zip files have now uploaded at 175 kbps. Well done to the team @ Oxford! and thank Jonathan Miller CPDN SysAdmin
|
|
|
|
|
|
I managed to get my remaining 16 EUs to upload 'overnight'.
So it's not fixed yet, just "getting there".
Data is still being moved off a couple of servers to storage, but more is coming in just as fast.
I've been watching this in the messages on one of my computers, as 16 files slowly uploaded.
According to the Status page yesterday, there were over 135,000 tasks running, and now it says 127,477, so it's coming down.
Just thinking out loud, if only a quarter of those "running" were due to pending uploads, and each one only had a quarter of their files waiting, that's about 90,000 zips fighting each other for disk space.
It must be somewhat like a person running an ultra-marathon through vast swarms of stampeding elephants, rinos and wildebeests, while juggling a dozen sharp knives.
There's been more hardware and software failures since the weekend, but Jonathan and Andy have their eyes on things.
____________
Backups: Here |
|
|
|
|
|
Les thank you for your post, like you say we're not out of woods yet,
there could be more bumps in the road ahead.
best wishes to the team @ Oxford!
and thank you again to Andy and Jonathan, CPDN SysAdmins for a job well done!
Byron |
|
|
|
|
|
Yes, thank you Les and others who work tirelessly to keep everything up and running.
I just wanted to add an 'FYI' that the uploader server issues again seem to be interfering with attempts to download work from the 'reference site' onto a windows machine which I've just set up for CPDN number crunching. Searching the forum archives, my symptoms are the same as those experienced and explained in message 44708 (http://climateapps2.oerc.ox.ac.uk/cpdnboinc/forum_thread.php?id=7442&nowrap=true#44708). Per the advice given there, I'll just remain patient while everything is returned to normal.
Thanks again,
-- Jim |
|
|
|
|
According to the Status page yesterday, there were over 135,000 tasks running,
My main machine has two tasks running but four more in the queue. As these are listed as being in progress when i look at the computer's page, I presume that those 135,000 include those queued on machines but not yet started? A trivial point, I know given the problems with hardware and software etc but it piqued my curiosity and I wondered how many tasks are actually, "in progress" My other linux machine doesn't have any in the queue at the moment so my own average would be half of those listed. I will leave it to someone else with more machines to work out something with more statistical validity! |
|
|
|
|
According to the Status page yesterday, there were over 135,000 tasks running,
My main machine has two tasks running but four more in the queue. As these are listed as being in progress when i look at the computer's page, I presume that those 135,000 include those queued on machines but not yet started? A trivial point, I know given the problems with hardware and software etc but it piqued my curiosity and I wondered how many tasks are actually, "in progress" My other linux machine doesn't have any in the queue at the moment so my own average would be half of those listed. I will leave it to someone else with more machines to work out something with more statistical validity!
The 135,000 number is inaccurate for at least two reasons. First, as you noted, some indeterminate number of those are waiting "Ready to start" on somebody's host(s). Another indeterminate number have downloaded to hosts that will never finish the task(s).
I have 6 machines running 24/7 - 3 of them are somewhat fast. There's very little in the queue "Ready to start" but a whole lot pending upload. I am only letting each machine go online once a day to update stats and trickle up and download my preferred wu's from other projects. I let only one of them at a time stay online for a day (intil its upload queue clears, then I leave it online) -- that means one is network enabled each day until its upload queue clears - it will take a few days to finish the 80+ uploads (per fast host) that are still pending. My slower hosts are all caught up and online for new work. Running on a slowish DSL. Expect uploads to catch up within a couple of days. Getting downloads from time to time (probably old wu's that timed out and got resubmitted automatically)
Figuring an overall reduction factor to adjust the supposed 130,000 tasks "out there" would be real difficult. I have prefs set to start downloading when any task is withing 28 hours of completing. So there's less than 25% ratio "Ready to start" versus "Running" here. I still have my 3 faster hosts that have more tasks "uploading" than they have "running" It will be a while - and I'm not going to push my uploads because there's lots of other people with worse network than what I have and I'm not going to do anything to overload the fragile servers.
____________
|
|
|
|
|
|
OK, my mistake. The label is actually Tasks in progress
This is an abbreviation for: I've sent this number of work units to client computers, and they aren't yet on my work list as being completed or failed. Therefore they're still out there somewhere..
As for new work that's occasionally being received, that's due to the resubmission script being fired up to slowly produce new data sets in the sequence of that past work that has been returned intact.
____________
Backups: Here |
|
|