Message boards :
Number crunching :
The uploads are stuck
Message board moderation
Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · 14 . . . 25 · Next
Author | Message |
---|---|
Send message Joined: 3 Sep 04 Posts: 105 Credit: 5,646,090 RAC: 102,785 |
Good job its not the weekend....... At least I got 2 complete w/u uploaded. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,202,087 RAC: 2,202 |
climateprediction.net 12-01-2023 01:22 [error] Error reported by file upload server: Server is out of disk space I got my 40 tasks all uploaded before this happened. I then started work on 5 new tasks and two or three of those 14 MegaByte files accumulated on my machine, but they went went up in a bunch. Now I have two screens full to send up with a 35 minute backoff. Glad my disk space is very large, and was just cleaned up. |
Send message Joined: 27 Mar 21 Posts: 79 Credit: 78,302,757 RAC: 1,077 |
Glenn Carver wrote: xii5ku wrote:It's not for a lack of the client's trying. First, I have still 1 running task per each of my two active computers, which causes a new untried file to be produced every ~7.5 minutes. Second, even if I intervene when the client applies an increasing "project backoff", i.e. make it retry regardless, nothing changes. Earlier yesterday, while I still was able to transfer some (with decreasing success), the unsuccessful transfers gut stuck at random transfer percentages. Later on, still long before the new 'server is out of disk space' issue came up, eventually all of the transfer attempts got stuck at 0 bytes transferred.FWIW, when I came home from work 3 hours ago it worked somewhat. It quickly worsened, and now _all_ transfer attempts fail. I've got the exact same situation now as the one @Stony666 described.There may be some boinc-ness things going on. I vaguely remember Richard saying something about uploads will stop processing if it tries & fails 3 times? Or something like that? Uploads are still ok for me, I've got another 1000 to do. Unlike some other contributors here, I have a somewhat limited upload bandwidth of about 8 Mbit/s. Maybe this plays a role why I have been less successful than others, maybe not. |
Send message Joined: 29 Oct 17 Posts: 1046 Credit: 16,316,506 RAC: 16,122 |
It's not for a lack of the client's trying. First, I have still 1 running task per each of my two active computers, which causes a new untried file to be produced every ~7.5 minutes. Second, even if I intervene when the client applies an increasing "project backoff", i.e. make it retry regardless, nothing changes. Earlier yesterday, while I still was able to transfer some (with decreasing success), the unsuccessful transfers gut stuck at random transfer percentages. Later on, still long before the new 'server is out of disk space' issue came up, eventually all of the transfer attempts got stuck at 0 bytes transferred.Do have these issues with other projects? And, when CPDN isn't playing catch up with their uploads, do you have the problem then? Just trying to understand if this is a general problem you have or whether it's just related to what's going on with CPDN at the moment. |
Send message Joined: 15 May 09 Posts: 4532 Credit: 18,835,737 RAC: 21,348 |
Unlike some other contributors here, I have a somewhat limited upload bandwidth of about 8 Mbit/s. Maybe this plays a role why I have been less successful than others, maybe not.*Mbit/s is 80 times faster than what I have. I don't have problems when the servers are working properly. - No issues with recent batches of Hadley models but problems yesterday evening even when uploads were going through due to the congestion. Now we are waiting for data to be moved off the server which should have happened automatically to prevent it filling up. Those above my pay grade (£0/hour) are investigating this. |
Send message Joined: 29 Oct 17 Posts: 1046 Credit: 16,316,506 RAC: 16,122 |
Update on the upload server 11:15GMT Had email from CPDN that they are moving data off the upload server, will be sometime before they can enable httpd again. Wasn't given a time estimate but they have to move 25Tb and last downtime it took them best part of a day to move the data from the broken upload server. |
Send message Joined: 1 Jan 07 Posts: 1059 Credit: 36,657,707 RAC: 14,406 |
Surely, for the long-term future (and with a weekend coming up), they have to configure a solution where the forwards pipe (upload server --> backing storage) runs faster than the inwards pipe (users --> upload server)? Even if that involves throttling the inwards pipe ... |
Send message Joined: 6 Aug 04 Posts: 195 Credit: 28,254,591 RAC: 10,553 |
Thank you for the update Glenn. Update on the upload server 11:15GMT |
Send message Joined: 29 Oct 17 Posts: 1046 Credit: 16,316,506 RAC: 16,122 |
Surely, for the long-term future (and with a weekend coming up), they have to configure a solution where the forwards pipe (upload server --> backing storage) runs faster than the inwards pipe (users --> upload server)?That is how it works when it's functioning normally, the transfer server runs to keep the upload server under quota. |
Send message Joined: 1 Jan 07 Posts: 1059 Credit: 36,657,707 RAC: 14,406 |
That is how it works when it's functioning normally, the transfer server runs to keep the upload server under quota.Judging by the timestamps in this thread, the upload server was open to users between about 10:30 and 00:30 yesterday - around 14 hours. We don't know how much was transferred to backing store in that time, but the excess of incoming over outgoing was enough to fill intermediate storage. If it's going to take 24 hours to transfer that excess, then the two rates - in practice, before final tuning - are seriously out of alignment. |
Send message Joined: 15 May 09 Posts: 4532 Credit: 18,835,737 RAC: 21,348 |
I am guessing the transfer server could probably cope with normal operation, just not with the number of backed up computers throwing zips at the upload server. |
Send message Joined: 29 Oct 17 Posts: 1046 Credit: 16,316,506 RAC: 16,122 |
There's been a period of disruption because of the difficulty with filesystems that's lead to Tb of files being where they shouldn't. I don't think there's any serious issues from the bits I know.That is how it works when it's functioning normally, the transfer server runs to keep the upload server under quota.Judging by the timestamps in this thread, the upload server was open to users between about 10:30 and 00:30 yesterday - around 14 hours. We don't know how much was transferred to backing store in that time, but the excess of incoming over outgoing was enough to fill intermediate storage. If it's going to take 24 hours to transfer that excess, then the two rates - in practice, before final tuning - are seriously out of alignment. I'm told that all project (inc: Had* models) data transfers are complete except the OpenIFS ones, that are yet to get to the point where the upload can be opened up again. |
Send message Joined: 27 Mar 21 Posts: 79 Credit: 78,302,757 RAC: 1,077 |
Glenn Carver wrote: xii5ku wrote:It's specific to CPDN. IIRC I had basically flawless transfers early after the start of the current OpenIFS campaign, but soon this changed to a state in which a certain amount of transfers got stuck at random transfer percentage. I worked around this by increasing max_file_xfers_per_project a little. This period ended at the server's X-Mas outage. — When I came home from work yesterday (evening in the UTC+0100 zone), I found one of the two active computers transferring flawlessly, the other one having frequent but not too many transfers getting stuck at random percentage. This worsened within a matter of hours on both computers, to the point that all transfer attempts got stuck at 0%. (Another few hours later, the server ran out of disk space, which changed the client log messages accordingly.)It's not for a lack of the client's trying. First, I have still 1 running task per each of my two active computers, which causes a new untried file to be produced every ~7.5 minutes. Second, even if I intervene when the client applies an increasing "project backoff", i.e. make it retry regardless, nothing changes. Earlier yesterday, while I still was able to transfer some (with decreasing success), the unsuccessful transfers gut stuck at random transfer percentages. Later on, still long before the new 'server is out of disk space' issue came up, eventually all of the transfer attempts got stuck at 0 bytes transferred.Do have these issues with other projects? And, when CPDN isn't playing catch up with their uploads, do you have the problem then? Just trying to understand if this is a general problem you have or whether it's just related to what's going on with CPDN at the moment. I just came home from work again and am seeing "connect() failed" messages now. Gotta read up in the message board for the current server status. As for other projects: I don't have projects with large uploads active at the moment. TN-Grid and Asteroids with infrequent small transfers and Universe with frequent smallish transfers are working fine right now on the same two computers, in the same boinc client instances as CPDN. Earlier in December, I took part in the PrimeGrid challenge which had one large result file for each task, and they transferred flawlessly too. So to summarize: It's a problem specifically with CPDN's upload server, and this problem existed with lower severity before the Holidays outage. ________ Edit: So… is the new "connect() failed" failure mode expected due to the current need to move data off of the upload server? Since last night, I have been logging the number of files to transfer on both of my active computers in 30min intervals, and these numbers have been monotonically increasing. (Each computer has got one OIFS task running.) I haven't done the math if the increase of the backlog is exactly the same as the file creation rate of the running tasks, but I guess it is. The client logs of the past 2 or 3 hours don't show a single success. |
Send message Joined: 1 Jan 07 Posts: 1059 Credit: 36,657,707 RAC: 14,406 |
Edit: So… is the new "connect() failed" failure mode expected due to the current need to move data off of the upload server?I'd say that the answer is embedded in message 67605, but you have to decode it. Had email from CPDN that they are moving data off the upload server, will be sometime before they can enable httpd again. Wasn't given a time estimate but they have to move 25Tb and last downtime it took them best part of a day to move the data from the broken upload server.The upload server filled up overnight. CPDN staff are moving files to another place, but it can't handle new files while the old ones are in the way. So staff have disabled our ability to upload files for the time being - they've "disabled httpd". The disabling of httpd has the effect of blocking our attempts to connect to the server when we want to upload files. Hence, the message "connect() failed". It's similar to the message you sometimes see, "Project is down for maintenance" - a planned stoppage, rather than an unplanned one (this time at least). |
Send message Joined: 29 Oct 17 Posts: 1046 Credit: 16,316,506 RAC: 16,122 |
xii5ku wrote: It's specific to CPDN. IIRC I had basically flawless transfers early after the start of the current OpenIFS campaign, but soon this changed to a state in which a certain amount of transfers got stuck at random transfer percentage. I worked around this by increasing max_file_xfers_per_project a little. This period ended at the server's X-Mas outage. — When I came home from work yesterday (evening in the UTC+0100 zone), I found one of the two active computers transferring flawlessly, the other one having frequent but not too many transfers getting stuck at random percentage.Maybe the first is occupying available bandwidth/slot. I see similar with my faster broadband. Getting stuck is perhaps to be expected giving your remote location?? As for other projects: I don't have projects with large uploads active at the moment. TN-Grid and Asteroids with infrequent small transfers and Universe with frequent smallish transfers are working fine right now on the same two computers, in the same boinc client instances as CPDN. Earlier in December, I took part in the PrimeGrid challenge which had one large result file for each task, and they transferred flawlessly too.Depends what you mean by 'large' & small in relation to transfer size. We opted for more 'smaller' files to upload rather than less 'larger' files to upload, precisely for people further afield. Each upload file is ~15Mb for this project. I still think that's the right approach even though I recognise people got a bit alarmed by the number of upload files that was building up. Despite the problem with 'too many uploads' blocking new tasks, I am reluctant to up the maximum upload size limit, esp if you are having problems now. So to summarize: It's a problem specifically with CPDN's upload server, and this problem existed with lower severity before the Holidays outage.I think that's part of your answer there. When there isn't a backlog of transfer clogging the upload server, it can tick along fine. Since last night, I have been logging the number of files to transfer on both of my active computers in 30min intervals, and these numbers have been monotonically increasing. (Each computer has got one OIFS task running.) I haven't done the math if the increase of the backlog is exactly the same as the file creation rate of the running tasks, but I guess it is. The client logs of the past 2 or 3 hours don't show a single success.It probably is as the upload server is not allowing any connections at present. If it really troubles people to see so many uploads building up in number we can modify it for these longer model runs (3 month forecasts). |
Send message Joined: 5 Aug 04 Posts: 178 Credit: 18,468,287 RAC: 64,501 |
If it really troubles people to see so many uploads building up in number we can modify it for these longer model runs (3 month forecasts).HM, 121 Trickle Files for a job, that lasts between 12 and 14 hours is way to much from my view. Does each trickle contain really usefull data or only the sign "WU is still alive" ? Supporting BOINC, a great concept ! |
Send message Joined: 6 Aug 04 Posts: 195 Credit: 28,254,591 RAC: 10,553 |
I'm OK with the current mix of task run time, file size, file numbers and our upload broadband speed, even with the outage, it's manageable for me. If needed, I can increase the VM disc for ubuntu to over 900GB. If it really troubles people to see so many uploads building up in number we can modify it for these longer model runs (3 month forecasts). |
Send message Joined: 12 Apr 21 Posts: 315 Credit: 14,658,552 RAC: 17,872 |
Surely, for the long-term future (and with a weekend coming up), they have to configure a solution where the forwards pipe (upload server --> backing storage) runs faster than the inwards pipe (users --> upload server)? I'd have also thought that from the time the uploads started flowing again, there would be hawk-eyes on that server until everything is back to normal. It kind of seems like we've been reacting to problems rather than proactively trying to prevent them. |
Send message Joined: 14 Sep 08 Posts: 127 Credit: 41,180,011 RAC: 54,807 |
I'd have also thought that from the time the uploads started flowing again, there would be hawk-eyes on that server until everything is back to normal. It kind of seems like we've been reacting to problems rather than proactively trying to prevent them. This is the most worrying part honestly. Disk filling up should be easily predictable based on ingress and egress rate. It's quite clear by now no one is monitoring the system, or has alerting, or trying to get ahead of trouble. While I know this is not a high available service, that doesn't mean more care isn't needed when it's having real trouble. If no one is bothering to watch more carefully after almost three weeks of downtime and multiple failed recovery, it start to feel keeping it up is not a priority at all. |
Send message Joined: 14 Sep 08 Posts: 127 Credit: 41,180,011 RAC: 54,807 |
Does each trickle contain really usefull data or only the sign "WU is still alive" ? I am curious about this too if anyone care to educate a bit. From reading other posts, my wild guess is that all trickles contain real results. It might just be a mean of breaking up the otherwise huge final result and spread the upload across the lifetime of a task. Otherwise, a whole lot of partial 2GB files could be really problematic for the server to hold onto. I could be totally wrong though. |
©2024 cpdn.org