Message boards :
Number crunching :
The uploads are stuck
Message board moderation
Previous · 1 . . . 9 · 10 · 11 · 12 · 13 · 14 · 15 . . . 25 · Next
Author | Message |
---|---|
Send message Joined: 5 Aug 04 Posts: 1118 Credit: 17,163,134 RAC: 2,081 |
Does each trickle contain really usefull data or only the sign "WU is still alive" I cannot imagine it takes a 14 MegaByte messaage every few minutes to report a task is still alive. |
Send message Joined: 7 Sep 16 Posts: 262 Credit: 34,672,453 RAC: 14,037 |
This is the most worrying part honestly. Disk filling up should be easily predictable based on ingress and egress rate. It's quite clear by now no one is monitoring the system, or has alerting, or trying to get ahead of trouble. While I know this is not a high available service, that doesn't mean more care isn't needed when it's having real trouble. If no one is bothering to watch more carefully after almost three weeks of downtime and multiple failed recovery, it start to feel keeping it up is not a priority at all. The confusion I have is that there's talk of contracts with certain deadlines that they're trying to meet, presumably with at least some resources behind getting that compute done. To then run an ingest server entirely out of disk, seemingly without warning, is a bit baffling. If it's cloud magic, I know it's (usually) not that hard to shut a box down, resize the virtual backing disk, boot it, growpart, and resize2fs - I do this all the the time on my VMs when I'm low on space, and some of the cloud OS images used actually have an init script that does this as needed - on boot, grow and resize the partition if needed. Going smaller is harder, but going bigger isn't. Either the contract goals matter, at which point having almost all the compute on the project stalled for what's coming up on a month seems quite absurd (I haven't gone out of my way to work around the BOINC limits when I'm so far underwater on uploads anyway). Or the contract goals don't, at which point... m'kay. Whatever, but then there's no real deadline to speak of, so it doesn't matter. I get that the issues seem to be on the cloud provider, but at some point, "Rack up a box with a bunch of disks, or just use a desktop to get things flowing!" makes sense. IMO. I'm fairly certain BOINC supports multiple upload servers - just a failover "Yes, take the WUs from the clients, we'll process them later!" box seems like it would be useful, though the amount of disk space required is clearly non-trivial. I am curious about this too if anyone care to educate a bit. From reading other posts, my wild guess is that all trickles contain real results. It might just be a mean of breaking up the otherwise huge final result and spread the upload across the lifetime of a task. Otherwise, a whole lot of partial 2GB files could be really problematic for the server to hold onto. I could be totally wrong though. My understanding of them, from a purely "compute volunteer" level, is that the trickle files contain the intermediate state of the model at various timepoints - so if the model crashes it can still return correct results. This made a lot of sense, IMO, on the "10-15 day" models that have been running (the N216s and such), but I'm not sure it's particularly useful with the 10 hour models - that's a pretty standard workunit compute period, so I can't imagine there are too many machines starting and not finishing them, with them unable to be reassigned. I've crashed a few from the OOM killer (5GB per task seems about the requirement in terms of system RAM - running 4 tasks on 16GB will OOM often enough to be a problem), but they should just get reassigned to someone else. Personally, if the total upload size is about the same, I'd prefer "more smaller files" over "one larger file," simply because when my ISP goes erratic, or the machines suspend at night for lack of usable solar power, it's less to resend or try to resume. Uploading 100x 20MB files is likely to be more reliable for me, with less retry, than uploading a single 2GB file. I expect these limits aren't around for other people, but neither are their fiber connections going to struggle with uploading a bunch of smaller files either. |
Send message Joined: 27 Mar 21 Posts: 79 Credit: 78,302,757 RAC: 1,077 |
Glenn Carver wrote: xii5ku wrote:One of the things which I tried on Wednesday was to bandwidth-limit the 'good' computer to less than half of my uplink's width and leave the 'bad' computer unlimited. This did not improve the 'bad' computer's situation though.It's specific to CPDN. IIRC I had basically flawless transfers early after the start of the current OpenIFS campaign, but soon this changed to a state in which a certain amount of transfers got stuck at random transfer percentage. I worked around this by increasing max_file_xfers_per_project a little. This period ended at the server's X-Mas outage. — When I came home from work yesterday (evening in the UTC+0100 zone), I found one of the two active computers transferring flawlessly, the other one having frequent but not too many transfers getting stuck at random percentage.Maybe the first is occupying available bandwidth/slot. I see similar with my faster broadband. Getting stuck is perhaps to be expected giving your remote location??I recall two other Germans commenting here on the Wednesday situation: One being stuck exactly as myself, the other having transferred everything at very high speed. I know of one user at the US East coast, on a big commercial datacenter pipe, who had an increasing rate of transfers getting stuck on Wednesday too, hours before the server ran out of disk space. – Asteroids@home: one file per result, 60…300 kBAs for other projects: I don't have projects with large uploads active at the moment. TN-Grid and Asteroids with infrequent small transfers and Universe with frequent smallish transfers are working fine right now on the same two computers, in the same boinc client instances as CPDN. Earlier in December, I took part in the PrimeGrid challenge which had one large result file for each task, and they transferred flawlessly too.Depends what you mean by 'large' & small in relation to transfer size. – PrimeGrid: one 128 MB file and one tiny file per result; very long task duration but I had multiple computers running, therefore there were occasions during which I saturated the upstream of my cable link for a few hours – TN-Grid: one file with <10KB per result – Universe@home: six files per result, ranging from 20 B to <60 kB; current task duration 45 minutes, resulting in a quite high connection rate Again, no problem between my clients and these project servers. [Back in May 2022, there was a competition at Universe@home (BOINC Pentathlon's marathon), during which the foreseeable happened: Its server had to drop most of the connection attempts of participants in this competition, as it could not handle the high combined request rate.] Also, over the handful of years during which I have been using BOINC now, there were multiple occasions during which I saturated my upload bandwidth with result uploads for longer periods (hours, days) at various projects without the problems which I encountered with upload11.cpdn.org. We opted for more 'smaller' files to upload rather than less 'larger' files to upload, precisely for people further afield. Each upload file is ~15Mb for this project. I still think that's the right approach even though I recognise people got a bit alarmed by the number of upload files that was building up. Despite the problem with 'too many uploads' blocking new tasks, I am reluctant to up the maximum upload size limit, esp if you are having problems now.It doesn't matter much for the failure mode which I described (and was apparently experienced by some others too). If a transfer got stuck after having transferred some amount of data, the next successful retry (once there was one) would pick up the transfer where it was left off. I.e. the previously transferred portion did not have to be retransmitted. Furthermore, regarding the client's built-in blocking of new work requests: That's a function of the number of tasks (a.k.a. results) which are queued for upload, not of the number of files which are queued for upload. (Also it's a function of number of logical CPUs usable by BOINC, as you recall.) Personally, I don't have a preference how you split the result data into files. For my amount of contribution – during periods during which the upload server works at least somewhat –, only the total data size per result matters. As for periods during which the rate of stuck transfers is very high but not 100%, I have no idea whether fewer or more files per result would help. And for completeness: Users who are network bandwidth constrained (such as myself), or who are disk size constrained, cannot contribute at all for the same duration as the upload server is unavailable; then the file split doesn't matter anymore at all, obviously. For most of the time of the current OpenIFS campaign during which the upload server wasn't down, the upload server was "ticking along fine" seemingly to one part of the contributors, and was all along showing signs of "just barely limping along" to another part of the contributors.So to summarize: It's a problem specifically with CPDN's upload server, and this problem existed with lower severity before the Holidays outage.I think that's part of your answer there. When there isn't a backlog of transfer clogging the upload server, it can tick along fine. The server's behaviour on Wednesday _might_ be an indication that some of the problems which lead to the Great Holiday Outage have not actually been resolved. (Vulgo, somebody somewhere _seems_ to flog a dead horse.) If it really troubles people to see so many uploads building up in number we can modify it for these longer model runs (3 month forecasts).The precise file split isn't too critical. What matters is: Which amount of result data do the scientists need in which time frame? Based on that, which rate of data transfers does the server infrastructure need to support? Take into account that many client hosts can only operate for as long as they can upload. — That's all pretty obvious to everybody reading this message board, but to me it's clear that somebody somewhere must have cut at least one corner too many last December. ________ PS: I'm not complaining, just observing. The scientist needs the data; I don't. |
Send message Joined: 26 Oct 11 Posts: 15 Credit: 3,275,889 RAC: 0 |
Hello All, Brief update on status. The upload server is back running and we are currently in the process of transferring ~24TB of built up project results from that system to the analysis datastores. This process is going to take ~5 days running 5 parallel streams (the files are all OpenIFS workunits). I have asked Andy to restart uploads but to throttle to ensure that our total stored volume does keep decreasing, i.e. our upload rate doesn't exceed our transfer rate. As such we'll be slow for a while but will gradually increase the upload server bandwidth to you guys as we clear batches. The issue was caused by an initial instability bought about because the system disks for the VMs that run the upload server and the data storage volumes are all actually hosted in the same physical data system. When the data volumes fill they affect the performance of the other disks as well.... This was exasperated because they allowed us to create extremely large volumes that were really beyond the capability of the storage system so we have to move the data internally as well. Not an idea solution and we've told JASMIN this. Thank you for your understanding in whats been a difficult few days. David |
Send message Joined: 9 Mar 22 Posts: 30 Credit: 1,065,239 RAC: 556 |
The upload server is back running ... I have asked Andy to restart uploads but to throttle ... Throttling appears to be too strict: nc -zvw 5 upload11.cpdn.org 80 nc: connect to upload11.cpdn.org port 80 (tcp) failed: No route to host ;-) |
Send message Joined: 1 Jan 07 Posts: 1051 Credit: 36,341,855 RAC: 2,973 |
Depends how he's implemented throttling. I would expect the easiest to be 'lower maximum number of simultaneous connections': the lucky few latch on, the rest wait their turn. |
Send message Joined: 5 Aug 04 Posts: 1118 Credit: 17,163,134 RAC: 2,081 |
Throttling appears to be too strict: But I can now ping the URL, which is an improvement over yesterday. $ ping -c 5 upload11.cpdn.org PING upload11.cpdn.org (192.171.169.187) 56(84) bytes of data. 64 bytes from 192.171.169.187 (192.171.169.187): icmp_seq=1 ttl=47 time=79.3 ms 64 bytes from 192.171.169.187 (192.171.169.187): icmp_seq=2 ttl=47 time=78.10 ms 64 bytes from 192.171.169.187 (192.171.169.187): icmp_seq=3 ttl=47 time=78.5 ms 64 bytes from 192.171.169.187 (192.171.169.187): icmp_seq=4 ttl=47 time=80.1 ms 64 bytes from 192.171.169.187 (192.171.169.187): icmp_seq=5 ttl=47 time=79.1 ms --- upload11.cpdn.org ping statistics --- 5 packets transmitted, 5 received, 0% packet loss, time 13088ms rtt min/avg/max/mdev = 78.523/79.204/80.135/0.557 ms |
Send message Joined: 29 Oct 17 Posts: 1030 Credit: 16,107,573 RAC: 15,433 |
Does each trickle contain really usefull data or only the sign "WU is still alive" ?The intermediate files uploaded as the model runs do indeed contain useful model results. Anyone familiar with meteorological data can copy the upload file from the project dir somewhere, unzip it and plot the model results. The data is in Grib format. Anyone wants more info on displaying model results happy to help, perhaps in a new thread. |
Send message Joined: 1 Jan 07 Posts: 1051 Credit: 36,341,855 RAC: 2,973 |
While the uploads are trickling in (pun intended), I think that the team should turn their attention to the BOINC server. As things stand at the moment, I suspect that a large number of tasks will start to pass deadline from 23 January. Some will genuinely have gone AWOL, but others will still be waiting for the bus. We don't need to release a whole lot of resends right away. If the team don't want to meddle with the database in the middle of this (and I wouldn't either), at the very least they should set a large number for <report_grace_period>x</report_grace_period>(from https://boinc.berkeley.edu/trac/wiki/ProjectOptions) |
Send message Joined: 12 Apr 21 Posts: 307 Credit: 14,300,326 RAC: 4,834 |
... the amount of disk space required is clearly non-trivial. Non-trivial is the key word here, both for CPDN infrastructure and user PC specs. requirements, when it comes to OIFS. We are, after all, basically trying to run on home PCs what's meant to run on supercomputers and an appropriate infrastructure. We're still witnessing the infrastructure struggles on the project side. The amount of data and the speed at which it gets generated appears to be beyond what the project can reliably handle right now. Looking at the user base side - Users with current credit stat. used to be in the 1100s to 1200s when all we had was Hadley models. Since OIFS, I've been noticing the numbers go down to 600s to 700s. Granted that the recent numbers may not be that reliable given the infrastructure struggles. If they are, it might be that many users' PCs may not be up to par, especially RAM, to run these models, and we've only gotten the low resolution models that are barely scientifically useful, per Glenn. Just wait for the multi-core, 22GB RAM ones. If I'm imagining this correctly, you'd need at least an 8-thread, 64GB RAM PC to just run 2 concurrently. Definitely non-trivial. The server's behavior on Wednesday _might_ be an indication that some of the problems which lead to the Great Holiday Outage have not actually been resolved. We've even got a name for it now! The Great Holiday Outage, very appropriate, I like it. :-D Thank you for your understanding in whats been a difficult few days. It'd seem to me that should read weeks not days, to properly acknowledge the situation. The waiting time was non-trivial, to borrow a word from the above paragraph. |
Send message Joined: 15 May 09 Posts: 4504 Credit: 18,450,004 RAC: 1,042 |
Slow going. 800MB transferred and now I need to "Retry Now" again. |
Send message Joined: 29 Oct 17 Posts: 1030 Credit: 16,107,573 RAC: 15,433 |
I'll check with Andy what's set. Our biggest headache at the moment is working around the 'feature' in the client that it allows multiple oifs tasks, 5Gb each, to start up on a machine with 8Gb. It seems to only check memory limit when running, which is too late, and not when initiating the task. I guess it was never designed for large memory tasks. I'm seeing a very large failure rate due to this. We've had this conversation before Richard you might recall, and we do can things at the server side, but I still think it's an issue in the client that it allows multiple tasks without checking the sum of their memory bounds. Maybe something to bring up on the boinc client github? Or next boinc public meeting? While the uploads are trickling in (pun intended), I think that the team should turn their attention to the BOINC server. |
Send message Joined: 12 Apr 21 Posts: 307 Credit: 14,300,326 RAC: 4,834 |
While the uploads are trickling in (pun intended), I think that the team should turn their attention to the BOINC server. YES, PLEASE. I vote for trying to foresee potential issues going forward and taking preventive steps. On the positive side, I seem to be one of the presumably few(er) who are able to latch on to trickle in right now. Hopefully the latch is strong enough to finally drain it all in this one shot. |
Send message Joined: 1 Jan 07 Posts: 1051 Credit: 36,341,855 RAC: 2,973 |
We've had this conversation before Richard you might recall, and we do can things at the server side, but I still think it's an issue in the client that it allows multiple tasks without checking the sum of their memory bounds. Maybe something to bring up on the boinc client github? Or next boinc public meeting?I'm popping out for an hour, to get some exercise and brain food (newspaper) while there's a break in the weather. I'll try to read David's spaghetti code again when I get back, and work out what is, and more importantly isn't, checked when a task is considered for running. It might be our old friend the staggered start: IFS doesn't actually occupy its full quota of disk and (I presume) memory space for several minutes after it's started. If BOINC only considers current free memory when deciding whether to start IFS task #2, then we're up s**t creek. That is certainly a candidate for GitHub, and I'm happy to write up whatever we conclude is a worthwhile integrated package at the conclusion of the debate. I'm less sure about raising it at a public meeting: perhaps the best approach would be to get it raised on GitHub first, and then get the Prof or whoever to back it up in public at the Workshop at the beginning of March. That gives us a deadline of the end of February to work out the detail of what's going on and propose a solution. |
Send message Joined: 26 Oct 11 Posts: 15 Credit: 3,275,889 RAC: 0 |
Hi, The current limit is 50 concurrent connections. Cheers David |
Send message Joined: 12 Apr 21 Posts: 307 Credit: 14,300,326 RAC: 4,834 |
Our biggest headache at the moment is working around the 'feature' in the client that it allows multiple oifs tasks, 5Gb each, to start up on a machine with 8Gb. It seems to only check memory limit when running, which is too late, and not when initiating the task. I guess it was never designed for large memory tasks. I'm seeing a very large failure rate due to this. This is nothing new to me and I've always assumed it was up to the user to manage. I do see the problems with it though. I remember slowing my 12C/24T 64GB RAM and a good SSD PC to a snail's crawl trying to run too many LHC ATLAS tasks which also require a lot of RAM per task. One mitigation I can think of is to increase max # of error/total/success tasks to something like 10/10/1 or higher, just keep tasks circulating until they get a successful completion. Einstein has all those values at 20. It might be our old friend the staggered start BOINC can't do that can it? I don't think I've ever seen or heard of it, other than people writing their own scripts to do that. |
Send message Joined: 5 Aug 04 Posts: 178 Credit: 17,308,699 RAC: 19,069 |
It might be our old friend the staggered start Not shure if I understand this correct, but you know, that you can put BOINC to sleep after start in cc_config.xml: <start_delay>90.000000</start_delay> This pauses BOINC for 90 seconds from start, before it begins to start the different WUs. This was introduced to make windows-systemstarts smoother Supporting BOINC, a great concept ! |
Send message Joined: 1 Jan 07 Posts: 1051 Credit: 36,341,855 RAC: 2,973 |
It might be our old friend the staggered startNo it's not something we have available at the moment, but it's something which has arisen in conversations between myself and Glenn since the IFS rollout started. I'll have to look back in the record to find whether it was on this public message board or in private messages. Not shure if I understand this correct, but you know, that you can put BOINC to sleep after start in cc_config.xml:That doesn't help, I'm afraid. There are three separate places where I would like to see an available delay: 1) After boot and logon, but before the BOINC client starts 2) After the BOINC client starts, but before any science app starts 3) On multicore machines, after science app n starts, before science app n+1 is launched <start_delay> implements number (2) from that list, but neither numbers (1) nor (3). Number (1) is sometimes needed on 'delayed start' machines: Linux, where GPU drivers have to load and initialise before the BOINC client queries their capabilities, and Windows 11, for just about everything. Number (3) is the one I'm thinking about pitching as a new enhancement, as a result of conversations with Glenn. |
Send message Joined: 5 Aug 04 Posts: 178 Credit: 17,308,699 RAC: 19,069 |
1) After boot and logon, but before the BOINC client starts I think this could be easily done by starting BOINC not with systemstart, but the equivalence for a DOS-Batch with a timeout-Statement and next step then BOINC-Start On my Windows-Machines (Win7 and Win10) I have used a different-methode, the job-scheduler with "after Userlogin" and X-Minutes delay. Supporting BOINC, a great concept ! |
Send message Joined: 29 Oct 17 Posts: 1030 Credit: 16,107,573 RAC: 15,433 |
A staggered start for OpeniFS would not help. OpenIFS hits its peak memory every timestep, the tasks will not stay the same 'distance' apart and would still crash if memory is over-provisioned. This is a current feature of boinc client that needs fixing. It's not a criticism of the code, PC hardware & projects have moved on from early boinc apps and the code needs to adapt. The only sane way to do this is for the client to sum up what the task has told it are its memory requirements and not launch any more than the machine has space for. OpenIFS needs the client to be smarter in its use of volunteer machine's memory. And I don't agree this is for the user to manage. I don't want to have to manage the ruddy thing myself, it should be the client looking after it for me. I think all we can do at present is provide a 'Project preferences' set of options on the CPDN volunteer page and set suitable defaults for no. of workunits per host, default them to low. With clear warnings about running too many openifs tasks at once. That is certainly a candidate for GitHub, and I'm happy to write up whatever we conclude is a worthwhile integrated package at the conclusion of the debate. I'm less sure about raising it at a public meeting: perhaps the best approach would be to get it raised on GitHub first, and then get the Prof or whoever to back it up in public at the Workshop at the beginning of March. That gives us a deadline of the end of February to work out the detail of what's going on and propose a solutionDefinitely a good idea. I can do some more testing and discuss with CPDN. Perhaps we can take this offline. |
©2024 cpdn.org