Message boards :
Number crunching :
What happens if I run out of disk space?
Message board moderation
Author | Message |
---|---|
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,199,602 RAC: 2,377 |
I run BOINC client in a partition all its own: 16 Gigabytes, approximately. The BOINC client typically allows three ClimatePrediction applications to run at a time. They tend to run at high priority because the completion time, from the very start, is often far longer than the calendar time. I am not worried about this because I know the results are accepted even if they are late. But my BOINC manager tells me all kinds of things. The applications I am getting on my main machine use a LOT of disk space. If I let the ClimatePrediction applications complete, the available disk space is about 90%, but an individual application can take upt to 10 GBytes or so. This is amazing. If all three applications did that at the same time, there would be no more space. Would one or more applications crash, or does the boinc client arrange not to schedule these until space is available? Or what? I am running Red Hat Enterprise Linux 5 on a dual hyperthreaded Xeon (4 logical processors). |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
If you look at your list of models, the regional models are failing. There are 2 sticky posts at the top of the Linux section which may apply: Here & Here As for the amount of disk space used, it'll be because crashed models don't clear up after themselves. You need to look at the names of the models currently in the Tasks tab, and compare these with the folder names that are under ...\projects\climateprediction.net. Then manually delete everything that's NOT current. Used space should be about a gig per model, plus some space for the programs. As to what happens when the disk fills up: Everything will crash from then on. Backups: Here |
Send message Joined: 27 Jan 07 Posts: 300 Credit: 3,288,263 RAC: 26,370 |
You should check the stderr_um.txt file for any error messages. Also, this thread may be relevant: http://climateapps2.oerc.ox.ac.uk/cpdnboinc/forum_thread.php?id=6901 |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,199,602 RAC: 2,377 |
Both my machines are 32-bit as far as computing is concerned. The processors on the big one are PAE, so I can use my 8 GBytes RAM, but no process can see over 32-bits worth of addresses. I thought I had turned off the regional modals a long time ago, so I should not be getting any. Looks like they are off. As far as disk space usage, it seems to be like this: trillian:boinc[~/BOINC/projects/climateprediction.net]$ du . | sort -nr 10853696 . 4587540 ./hadcm3n_y8a0_1980_40_007618581 4336408 ./hadcm3n_y8a0_1980_40_007618581/dataout 3962520 ./hadcm3n_ye8q_1940_40_007615112 3711388 ./hadcm3n_ye8q_1940_40_007615112/dataout 1957320 ./hadcm3n_u028_1980_40_007693499 1706176 ./hadcm3n_u028_1980_40_007693499/dataout 250664 ./hadcm3n_ye8q_1940_40_007615112/datain 250664 ./hadcm3n_y8a0_1980_40_007618581/datain 250664 ./hadcm3n_u028_1980_40_007693499/datain 146164 ./hadcm3n_ye8q_1940_40_007615112/datain/masks 146164 ./hadcm3n_y8a0_1980_40_007618581/datain/masks 146164 ./hadcm3n_u028_1980_40_007693499/datain/masks 71228 ./hadcm3n_ye8q_1940_40_007615112/datain/dumps 71228 ./hadcm3n_y8a0_1980_40_007618581/datain/dumps 71228 ./hadcm3n_u028_1980_40_007693499/datain/dumps 33224 ./hadcm3n_ye8q_1940_40_007615112/datain/ancil 33224 ./hadcm3n_y8a0_1980_40_007618581/datain/ancil 33224 ./hadcm3n_u028_1980_40_007693499/datain/ancil 2124 ./txf 2096 ./gfx 620 ./hadcm3n_ye8q_1940_40_007615112/datain/ancil/ctldata 620 ./hadcm3n_y8a0_1980_40_007618581/datain/ancil/ctldata 620 ./hadcm3n_u028_1980_40_007693499/datain/ancil/ctldata 532 ./hadcm3n_ye8q_1940_40_007615112/datain/ancil/ctldata/STASHmaster 532 ./hadcm3n_y8a0_1980_40_007618581/datain/ancil/ctldata/STASHmaster 532 ./hadcm3n_u028_1980_40_007693499/datain/ancil/ctldata/STASHmaster 348 ./hadcm3n_u028_1980_40_007693499/jobs 340 ./hadcm3n_ye8q_1940_40_007615112/jobs 340 ./hadcm3n_y8a0_1980_40_007618581/jobs 84 ./hadcm3n_ye8q_1940_40_007615112/datain/ancil/ctldata/stasets 84 ./hadcm3n_y8a0_1980_40_007618581/datain/ancil/ctldata/stasets 84 ./hadcm3n_u028_1980_40_007693499/datain/ancil/ctldata/stasets 48 ./hadcm3n_u028_1980_40_007693499/tmp 44 ./hadcm3n_ye8q_1940_40_007615112/tmp 44 ./hadcm3n_y8a0_1980_40_007618581/tmp 16 ./txf/CVS And these three are the programs currently executing. Looks like the dataout files are the problem. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,199,602 RAC: 2,377 |
As for the amount of disk space used, it'll be because crashed models don't clear up after themselves. You need to look at the names of the models currently in the Tasks tab, and compare these with the folder names that are under ...\projects\climateprediction.net. That did not seem to be the case. I terminated all three models that were running, deleted everything in projects/climateprediction.net, and started over. I did that because it was just about to overflow. It downloaded me three new models that are now running and they seem to be taking up 2 gigabytes each in directories such as ~/BOINC/projects/climateprediction.net/hadcm3n_ydi1_1980_40_007832936/dataout and they are only about 20% complete. the dataout directories are full of large files. You should check the stderr_um.txt file for any error messages. The stderr_um_text files are all 0 bytes. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
hadcm3 models can build up files to a bit over 1 Gig each. The data gets zipped and uploaded to the project servers every 25% of the way through. Don't touch the files in the data out directory! That's the result of all of the crunching so far! Backups: Here |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,199,602 RAC: 2,377 |
hadcm3 models can build up files to a bit over 1 Gig each. The data gets zipped and uploaded to the project servers every 25% of the way through. Well, I already deleted it. The boinc system was just about to run out of disk space. Saying that the files build up to a bit over a gig each is a big understatement. On my machine, I have a partition that is about 16 gigabytes exclusively for boinc projects, and most tasks take small amounts of space. World community grid is the second biggest user and all its tasks together take less than one gig. Climate Prediction is the largest. Typically there are three c.p. tasks running (I have 4 processors), and each one is at about 25% completion and they take about 5.17 GBytes total already. They were taking about 4 Gigabytes each when the system almost ran out of space and I cancelled them out. This is how it is at the moment. 5794316 ./projects 5396096 ./projects/climateprediction.net 1924300 ./projects/climateprediction.net/hadcm3n_ydi1_1980_40_007832936 1863836 ./projects/climateprediction.net/hadcm3n_o34p_1980_40_007833299 1673168 ./projects/climateprediction.net/hadcm3n_ydi1_1980_40_007832936/dataout 1612704 ./projects/climateprediction.net/hadcm3n_o34p_1980_40_007833299/dataout 1444612 ./projects/climateprediction.net/hadcm3n_yiry_1980_40_007833065 1193480 ./projects/climateprediction.net/hadcm3n_yiry_1980_40_007833065/dataout Is it typical that users run Climate Prediction in even larger partitions? How big a partition is really required? Your estimate of about one gig per task seems far too small. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
My BOINC partitions are 10 GIGs. This includes both this main site, and our beta test site, and I have many versions of programs on that part that have been tested over the years. Currently: machine 1: Total in use: 2.64 Gigs the model's folders: 836 Megs for a hadcm3n model 499 Megs for a hadam3p model and 633 Megs for the beta folders The rest is common files machine 2: Total in use: 3.0 Gigs 793 Megs for one hadcm3n model 782 Megs for a 2nd hadcm3n model and 693 Megs for the beta folders The rest is common files ---------------- I uploaded the first lot of zip files a few days ago, which is why each model's folder size is under a gig, but before that one of them was at about 1.3 Gigs. Unless the Linux version is a lot different, your set up is STRANGE. 25% is the point at which the files are zipped up and sent back to the project. After which the folder size should drop. Backups: Here |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,199,602 RAC: 2,377 |
I uploaded the first lot of zip files a few days ago, which is why each model's folder size is under a gig, but before that one of them was at about 1.3 Gigs. When I was watching the sizes of the various boinc processes, I did notice that c.p. did drop from time-to-time and ranged, IIRC, from about 4 GBytes to 12 GBytes for c.p. alone. But the last three tasks were all growing steadily together and when things were getting close to totally full in that partition, I started it over. So it seems to be running "normally" other than taking what seems to be an unreasonable amount of disk space. What do you suppose is STRANGE about my setup? I have a 16 GByte partition for boinc. I do not suppose that is strange. What is in there is what the c.p. put in there. The entire directory structure for Climate Prediction I emptied out about a month ago because someone here said there was probably a lot of leftover stuff from jobs that terminated strangely. I found none, but deleted everything to be sure, and started over. The error files for c.p. are 0 bytes, so c.p. is not aware of anything unusual. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
I'll have a talk with the other moderators about your large file sizes, and see if anyone knows anything. Backups: Here |
Send message Joined: 15 May 09 Posts: 4532 Credit: 18,787,672 RAC: 19,721 |
On my linux system, the current total space used by Project directory is 2GB that is with 3 HADAM3P tasks, 1 running, 1 waiting to start and 1 suspended to allow a HADAM3CN task to run. I have seen usage go up as high as 4.6GB when running only HADAM3CN tasks but I think that may have been exacerbated by one of the servers being down at the time. I haven't seen it go up that high for a while. Dave |
Send message Joined: 15 May 09 Posts: 4532 Credit: 18,787,672 RAC: 19,721 |
Or it may have been when I had some crashed task files around. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,199,602 RAC: 2,377 |
I'll have a talk with the other moderators about your large file sizes, and see if anyone knows anything. Thank you. I would hate to have to quit c.p. because of this issue. It is the most important BOINC project I run. Maybe if I could trick it into downloading only one task at a time, or perhaps two, I would not run out of space. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,199,602 RAC: 2,377 |
Well, in about two days, mine is now this: 6077656 ./projects 5674288 ./projects/climateprediction.net 2046624 ./projects/climateprediction.net/hadcm3n_ydi1_1980_40_007832936 1967412 ./projects/climateprediction.net/hadcm3n_o34p_1980_40_007833299 1795492 ./projects/climateprediction.net/hadcm3n_ydi1_1980_40_007832936/dataout 1716280 ./projects/climateprediction.net/hadcm3n_o34p_1980_40_007833299/dataout 1496904 ./projects/climateprediction.net/hadcm3n_yiry_1980_40_007833065 1245772 ./projects/climateprediction.net/hadcm3n_yiry_1980_40_007833065/dataout These are all hadcm3n tasks and all are running. Two are "high priority". Actually, they all should be because they are close to not completing on time. They are due June 18 and have over 1300 hours each to complete. I have no crashed files around. And it looks like no errors. ls -l hadcm3n_o34p_1980_40_007833299/stderr_um.txt hadcm3n_ydi1_1980_40_007832936/stderr_um.txt hadcm3n_yiry_1980_40_007833065/stderr_um.txt hadcm3n_yiry_1980_40_007833065/dataout/stderr_um.txt -rw-r--r-- 1 boinc boinc 0 Mar 19 00:35 hadcm3n_o34p_1980_40_007833299/stderr_um.txt -rw-r--r-- 1 boinc boinc 0 Mar 18 23:35 hadcm3n_ydi1_1980_40_007832936/stderr_um.txt -rw-r--r-- 1 boinc boinc 0 Mar 19 05:03 hadcm3n_yiry_1980_40_007833065/dataout/stderr_um.txt -rw-r--r-- 1 boinc boinc 0 Mar 19 05:03 hadcm3n_yiry_1980_40_007833065/stderr_um.txt |
Send message Joined: 19 Apr 08 Posts: 179 Credit: 4,306,992 RAC: 0 |
Jean-David, I'm not sure if this is impacting you, but have you seen this sticky about issues running RHEL 5 and its derivatives? Edit: I see Les already linked to it... nevermind. |
Send message Joined: 5 Aug 04 Posts: 1120 Credit: 17,199,602 RAC: 2,377 |
Yes, I saw that. The problem described, "It appears that the hadam3p regional models (EU, SAF, PNW) don't play well with RHEL/CentOS/Scientific Linux 5 and crash immediately." is not occurring because I told the server not to send me those. The actual problem with those is that one of the system libraries is too old for the builds so they crash as soon as they call one of those librarires. Red Hat support each release for 7 years (recently increased to 10 years) so people need not upgrade unless they want new features. They reverse port all bug fixes and security fixes. But many BOINC applications assume you are running the latest and greatest (or nearly so) and do not allow for long-lived stable distributions. But as I said, I do not accept jobs of the hadam3p variety. Run only the selected applications UK Met Office HadSM3 Slab Model: yes UK Met Office HadCM3L Coupled Model: yes UK Met Office HadAM3: yes UK Met Office HadSM3 Mid-Holocene: yes UK Met Office HadAM3P: no UK Met Office FAMOUS: yes UK Met Office HadAM3P European Region: no UK Met Office HadAM3P Southern Africa: no UK Met Office HadAM3P Pacific North West: no UK Met Office HadCM3 Coupled Model Full Resolution Ocean: yes |
Send message Joined: 19 Apr 08 Posts: 179 Credit: 4,306,992 RAC: 0 |
When the CPDN developers setup the hadam3p's they used an older distribution, but in trying to target the largest set of users they chose the most popular distribution, Ubuntu, which still had a more recent kernel and libraries than the oldest supported Red Hat. Hopefully in the future they'll use an older Scientific (a free distribution which releases synchronously with Red Hat Enterpise) as the development system. Actually there were some reported problems with the RHEL 5 libstdc++6 and compression with hadcm3n. Could the source of your disk room troubles be related to bad compression? If you have administrative privileges on your machine you can get an RPM from Red Hat which will install a later version of libstdc++6. If not you can try the link mentioned in the sticky which points to an ingenious workaround from an unusually dedicated member. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
The question under consideration in this thread, is: Why are the hadcm3n models on this Linux system taking up approx 3 times the space as on my Windows system? There's been no reply to my question on the other board, and the only 2 things that I can think of are, that the zips aren't being created, or that they're being created, and then not uploaded. Jean Are there message lines, either in Messages, or in the stdoutdae.txt file, which say something like: Started upload of hadcm3_a009_1859_10_000258667_0_1.zip? And while I'm posting: 1) There's NO deadline in this project. The one that people keep quoting is just an artificial one that's a requirement of the BOINC system. 2) It's been posted many times that the hadcm3 models are very competitive for the FPU, and slow each other down A LOT. Only having half the number of these as there are processor cores is a good rule of thumb. Otherwise, they go into high priority running a lot of the time. 3) The reason that the project moved on to newer versions of library files, is that the University department for which our programmers work, has moved on to newer compilers, which no long support the older libraries. Backups: Here |
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
This is a real puzzler. With various recent Ubuntu versions my hadcm3n subdirs in the BOINC/projects/climateprediction.net folder are about 1.2-1.7Gig depending on how far the models have progessed and the uploads. Using ext3 or ext4 filesystem. Is it possibly something to to with the particular filesystem or filesystem parameters? Is this an ext{2,3,4} filesystem or something else like xfs zfs ? It's remotely possible that if the filesystem was created with very large blocks or chunks that there is wasted space with the smaller files. This is just a guess at a remote possibility. |
Send message Joined: 19 Apr 08 Posts: 179 Credit: 4,306,992 RAC: 0 |
(Eirik, how's the weather in Northfield?) I have three hadcm3n's now and their directory sizes are: 1391436 projects/climateprediction.net/hadcm3n_o1t6_1980_40_007833447 (96.8% complete) 1193884 projects/climateprediction.net/hadcm3n_2056_1940_40_007858548 (39.7%) 1059632 projects/climateprediction.net/hadcm3n_o3f2_2020_40_007857560 (17.4%) Ext3 partition running 64-bit Ubuntu 10.04 here. On this four-core machine that I've been running CPDN on for nearly three years, I've never seen my total CPDN disk usage go above 6 Gigs, even when I dedicate all the cores to CPDN. |
©2024 cpdn.org