climateprediction.net home page
Posts by Ingleside

Posts by Ingleside

41) Message boards : Number crunching : Hyperthreading (Message 46244)
Posted 16 May 2013 by Ingleside
Post:
Haven't benchmarked resently, but did benchmark my i7-920 when it was new. It's running stock speed, 6 GB memory in triple-channel-mode and probably at 1066 MHz-memory-speed. Benchmarked on, if not mis-remembers, a Hadam3P-model, since AFAIK it was before the various regional models was released. In any case, the results are:

1 instance to 1st. trickle: 6237 seconds/trickle.
4 instances to 3rd. trickle and averaging: 8456 seconds/trickle.
8 instances to 3rd. trickle and averaging: 14002 seconds/trickle.

Meaning, running 4 instances it's performing as a 3-core-computer, while running 8 instances it's performing as a 3.5-core-computer. The advantage of running 8 over 4 instances was 21%.

The benchmarking was done without any turbo-boost (don't even remember if it has turbo at all...), and the same model was re-run from beginning to remove any effects from variable speed during a model and between models.


As for the i3770K not having the same HT-effect, my 1st. suggestion would be turbo-boost, but if you've disabled this... Another possibility is, the i7-920 is triple-channel, even at 1066 MHz it's got the same total bandwidth as the i3770K running at default dual-channel 1600 MHz-memory-speed. Since the cpu is also faster, it's possible the memory-bandwidth is saturated earlier than on the slower i7-920.
42) Message boards : Number crunching : Several jobs uploads in project backoff (Message 46153)
Posted 3 May 2013 by Ingleside
Post:
You have exploded an urban myth!

Well, some myths are easy to bust...


43) Message boards : Number crunching : Several jobs uploads in project backoff (Message 46151)
Posted 3 May 2013 by Ingleside
Post:
The time limit for uploading files from any project was extended. I can't remember whether the limit is now two or three months, but in any case it's far longer than we need.

It's 90 days.

But, but, but... each file is still only allowed 100 upload attempts, after which it expires. That's the BOINC rule. 100 is plenty but please don't use up the files' lives by repeatedly pressing the Retry now button in the Transfers tab. The files come to no harm while they wait.

I've never seen anything to a "100 upload attempts"-rule, and seeing how a file can easily reach this limit in 4 days (assuming re-tries once per hour), it wouldn't make any sence to increase the limit from 14-day to 90 days in this case.

To do a little test, blocked internet-connection and hit "retry" on a SIMAP-upload 110 times... no problem. Did a little editing, and, as BoincTask happily shows, it's now retried... 1234567 times, hits retry, 1234568 times, 1234569 times, 1234570 times, 1234571 times...

Since 1234567 >> 100 I didn't see anything to any 100-retry-limit on uploads...
44) Questions and Answers : Windows : Intel Visual Fortan run-time error (Message 45916)
Posted 13 Apr 2013 by Ingleside
Post:
For the "killer trickle" to be sent to the correct target, that target, i.e. climate model, needs to return a trickle_up file for the server to find it.
As has been said, this is unlikely to happen, so they CAN'T be killed from the server.

Aborting tasks without relying on trickle-messages has been part of BOINC since around BOINC-Client v5.10.x.
45) Message boards : Number crunching : Download Checksum Error (Message 45823)
Posted 6 Apr 2013 by Ingleside
Post:
Got atleast one old wu including white-space as part of the MD5 and v7.0.60 didn't give any MD5-errors, so looks like the fix works as it should.

While there wasn't any MD5-errors, the wu did still error-out, but this was due to one of the files missing from server. Copying the link to the web-browser also gave 404 Not Found, so this isn't a client-problem.
46) Message boards : Number crunching : Download Checksum Error (Message 45812)
Posted 5 Apr 2013 by Ingleside
Post:
Have only managed getting assigned one model since upgraded to v7.0.60, but atleast no download-errors this time. Wu is from December and is wherefore old.

Since have done another scheduler-request after getting the new model, can't check if the scheduler-reply included white space as part of the md5.


47) Message boards : Number crunching : Upload stops at 10MB with HTTP error (Message 43803)
Posted 15 Feb 2012 by Ingleside
Post:
Uploads can stack up indefinitely and not be lost, true?

Not completely indefinitely, if you're running an old BOINC-client the cut-off is 14 days from 1st. upload-try of each individual upload, while with v6.10.0 and later BOINC-clients the cut-off has been extended to 90 days.

48) Message boards : Number crunching : Project file upload handler is missing. (Message 43752)
Posted 5 Feb 2012 by Ingleside
Post:
Or so I thought,
I have just been through a copy of clientstate.xml without finding a misspelling of handler. I am still getting the message Project file upload handler is missing.
I have tracked down the zip file in question in the .xml file but am afraid it makes little sense to me, other than that the file is not transferring which I could tell by looking @ the transfers tab.

I am quite happy for someone to point out something in the portion of the file which is bleeding obvious, even if not to me.

Hmm, is it the computer running v6.13.10?

If so, the 1st. step is to upgrade to v7, since v6.13.10 has some major bugs, among others with handling of trickle-uploads.


If after the upgrade to v7 this file is still stuck without uploading, try changing the status-part from zero to 1, as shown below:

<upload_url>http://uploader1.atm.ox.ac.uk/cpdn_cgi/file_upload_handler</upload_url>
</file>
<file>
<name>hadam3p_eu_8h04_2000_1_007687181_0_4.zip</name>
<nbytes>13743083.000000</nbytes>
<max_nbytes>150000000.000000</max_nbytes>
<md5_cksum>41130e8681805ab6432ca43cea075478</md5_cksum>
<status>1</status>
<upload_url>http://uploader1.atm.ox.ac.uk/cpdn_cgi/file_upload_handler</upload_url>
<persistent_file_xfer>
<num_retries>14</num_retries>
<first_request_time>1327526537.473585</first_request_time>
<next_request_time>1328311850.488157</next_request_time>
<time_so_far>295.883478</time_so_far>
<last_bytes_xferred>0.000000</last_bytes_xferred>
<is_upload>1</is_upload>
</persistent_file_xfer>

Dave
49) Message boards : Number crunching : More FPU or Integer Power needed? (Message 43444)
Posted 20 Nov 2011 by Ingleside
Post:
My recommendation was based on what we have seen on cpdn. A Core i7 920 (hardly a high end processor) can beat a higher priced Phenom II X6 1100T in total throughput of models.

Hmm, in my experience the i7-920 and X6-1090T performs similarly, but the X6-1090T has decidedly a big advantage then it comes to cost of buying. Just for the cpu the i7-920 was 33 % more expensive than the X6-1090T, and also other things like mainboard and memory was more expensive for i7-920 than for Amd.

Now, atleast around here no-one has sold any i7-920 for the last year atleast, but let's look on the current entry-level offering from intel among the i7-cpu's, the i7-2600K. The i7-2600K is 44 % more expensive than an AMD's X6 1100T. No idea how fast the i7-2600K is, but would still guess it's less than 40 % faster...

A hex-core intel-cpu should be faster, but is also much more expensive. Also, running multiple memory-hungry CPDN-models will have an impact on performance, so it's not certain a hex-core gives much higher performance than a quad-core does.
50) Message boards : Number crunching : Set no gpu option (Message 43177)
Posted 8 Oct 2011 by Ingleside
Post:
I don't think that the server version is sufficiently up to date for the "no gpu" code.

CC_config can apparently be set on user's computers to do the same thing.

Unfortunately in v6.12.xx and earlier clients, users can't disable individual projects from using GPU's, the available cc_config-options will disable GPU's for all projects. Seeing that MarkJ runs GPUGRID and Byron runs SETI@home, neither of them would want to disable GPU-crunching on their computers.

Disabling GPU-crunching for individual projects is an option in v6.13.x, but these clients is alpha-clients with many rough edges, so for anyone not being alpha-testers it's recommended to stay away from these clients for now. Atleast from the look of things, GPUGRID doesn't currently work with v6.13.x-clients, since it's one of the many projects that haven't disabled upload-certificates yet.
51) Message boards : Number crunching : HADAM3P - Maximum elapsed time exceeded (Message 42775)
Posted 12 Aug 2011 by Ingleside
Post:
Several days ago, I downloaded three Hadam3p WU?s. The to completion time was listed as 1548 hours! I know that these to completion times are often significantly inflated, often 2 or 3 times the actual running time, but, this is about 10 times what it will take to run the WU?s.

Are other people are seeing this wild inflation or does it have something to do with the tweak that I made to the client_state file to fix a problem with an earlier WU? that WU had a very low to completion time of only 18 hours. That tweak is described above in this thread. Do I need to undo the tweak?

If you finished one of the broken Hadam3p-tasks by editing fpops_bound but didn't also increase fpops_est, your Duration Correction Factor (DCF for short) will increase accordingly. So, if example the initial estimate for the broken model was 10 hours, but the model actually took 100 hours to run, your DCF was increased 10x than before.

This new DCF will influence all future estimates, so with a new Hadam3p with "correct" fpops_est, instead of 100 hours it will show 1000 hours.

The DCF will slowly decrease again as you finish tasks, if not mis-remembers it decreases max 10% for each task, except if client thinks it's too large difference between current DCF and the lower one so only decreases with 1% per task... But in any case, it should slowly decrease again.


The DCF and therefore the estimates will never be very good here at CPDN, since for one thing HADCM3N is too high estimate, so after a string of these the DCF will become 0.5 or something, but a single Hadam3p will increase DCF back to 1 again. Also, some of the models the speed is significantly dependent on other things computer runs, if you runs multiple of the same model they can slow-down eachother, so even this will give some variations between runs.


Edit:

You can see your current DCF in BOINC Manager, as long as you're running v6.6.xx or later, by selecting the Project-tab, select a project, and hit "Properties". DCF is the last one listed.

The DCF is also displayed on the web-page, if you look on one of your own computers details you'll see the DCF.
52) Message boards : Number crunching : HADAM3P - Maximum elapsed time exceeded (Message 42719)
Posted 30 Jul 2011 by Ingleside
Post:
Thanks for the info. Will try it tonight after I get home from work. Wish me luck (will make backup first).

Just to be sure that I understand, I should modify the entry from this:

<rsc_fpops_bound>796838333333330.000000</rsc_fpops_bound>

To look like this?

<rsc_fpops_bound>796838333333330.000000</rsc_fpops_bound>9

Is this correct? Please respond.

No, it should be from:
<rsc_fpops_bound>796838333333330.000000</rsc_fpops_bound>
to
<rsc_fpops_bound>9796838333333330.000000</rsc_fpops_bound>


53) Message boards : Number crunching : HADAM3P - Maximum elapsed time exceeded (Message 42717)
Posted 30 Jul 2011 by Ingleside
Post:
<rsc_fpops_bound> in client_state.xml

<name>hadam3p_pnw_314p_1995_1_007369937</name>
<app_name>hadam3p_pnw</app_name>
<version_num>609</version_num>
<rsc_fpops_est>79683833333333.000000</rsc_fpops_est>
<rsc_fpops_bound>796838333333330.000000</rsc_fpops_bound>
<rsc_memory_bound>364000000.000000</rsc_memory_bound>
<rsc_disk_bound>2000000000.000000</rsc_disk_bound>

Is this right? Also please explain which value I have to change and to what.

I don?t want to waste 3 and a half days of crunching on defective WU only to have it crash that is fixable.

You'll need to change <rsc_fpops_bound>.

Just make sure BOINC is stopped, open-up client_state.xml in Notepad (or something similar under Linux), and add another number to <rsc_fpops_bound> (before the decimal-point), save the new client_state.xml and re-start BOINC.

It doesn't matter if you also changes <rsc_fpops_bound> for other tasks, so it's possible to search & replace all occurrences of <rsc_fpops_bound> with <rsc_fpops_bound>9 or something (adding an extra 9 to all).

To not get very high duration correction factor, it's also an idea to change <rsc_fpops_est>, by adding an 8 or 9 at the start. This should only be done to the wrongly-estimated task(s).
54) Message boards : Number crunching : Needs more disk space (Message 42710)
Posted 29 Jul 2011 by Ingleside
Post:
Thanks for those. The information isn't readily apparent on the Climateprediction site, although it's much more evident on some of the others.

I looked in Computing Preferences, and it's set to "not more than 50GB, but leave at least 3GB free". The drive is nowhere near full, so I don't understand what the "Notice from Server" is all about. Unfortunately, it's not possible to "copy and paste" stuff from BOINC so I can't repeat the message here.

If you've changed preferences locally on your BOINC-client, these will override your web-based preferences. Even just looking on the preferences on client and clicking "ok" without changing anything, will mean web-based preferences won't be used any longer. If this is your problem, to start using web-based preferences again, open-up the local client-preferences and hit "clear".

55) Message boards : Number crunching : Server can't open log file (Message 42662)
Posted 24 Jul 2011 by Ingleside
Post:
But BOINC clients cache the IP forever (at least older ones do) and will continue giving you the "connect" error message even if the server is already up and running, just with a fresh IP.

Don't remember the exact version, but this was fixed around v6.10.30, so any resent BOINC-clients shouldn't normally have any problems getting the new IP.



56) Message boards : Number crunching : Server can't open log file (Message 42631)
Posted 17 Jul 2011 by Ingleside
Post:
I'm succumbing to temptation - not expecting any comments- but why oh why don't organisations investing in huge hardware/database installations check back in history and use the only genuinely 365/24/7 system that has been proven across the world. Namely OpenVMS/Rdb. The last system I worked on had zero downtime in 10 years. (excluding 1 night out a year for new releases.) Yeah , I know - it's a legacy system. Oh well.
.

Well, it's the 1st. time I've heard of an OS that keeps running flawlessly then the hardware it's running on has stopped working...
57) Message boards : Number crunching : NO WORK! (Message 42220)
Posted 19 May 2011 by Ingleside
Post:
Phantom models occur from time to time on this project and on others when the server load is high and the download process fails to complete properly. A result assignment record then appears but no corresponding entry is available in the BOINC Manager. As far as I know, it isn't possible to resend that work - at any rate, CPDN doesn't do it.

Re-issuing of "lost" work was added as an option late 2005, so have been used by other BOINC-projects for over 5 years now. For a project it's enabled by including <resend_lost_results> in their normal project-config-file. The only other requirement is a minimum client-version, but this means v4.45, meaning 99.99% of active BOINC-users meets this part (and all CPDN-users since doesn't allow so old clients).

Then enabled, re-issuing will send "lost" work to same computer next time computer asks for work (*), but within the limits of how much works asked for, and any other limits like max tasks to issue in one go, and of course doesn't re-issue work that can't fit within deadline or isn't needed any longer. So even if CPDN enables this option now, no-one will get 100+ old, "lost" models sent in one go. ;)

The only downside with enabling re-issuing of "lost" work is that it adds extra load to database-server, so it's possible this will be too high load for CPDN to handle...


(*): WCG runs by now very old server-code, so from this project you'll get re-issue even if don't ask for work, and will get all "lost" work at once. This won't be an issue for CPDN if enables <resend_lost_results>, since CPDN runs more resent code.
58) Message boards : Number crunching : Upload problems (Message 42019)
Posted 24 Apr 2011 by Ingleside
Post:
From what I can gather, the problem seems particularly severe in PNW tasks for Linux. Would it be possible to disable the distribution of these work units to Linux clients until resolved?

This could fairly easily be done with making a customized plan-class, but with the current less-than-optimal staff-situation wouldn't expect this to happen at this point.

I have tried editing the client_state.xml file to no avail. Does the client make any contact with the URL http://boinc1.coas.oregonstate.edu/cpdn_cgi_main/ at any stage prior to upload? Could it be an error emanating from them?

The upload-server is only accessed then tries to upload a file, not before, so with the corruption being present at the time task was 1st. downloaded to client it has nothing to do with connections to upload-server.

As for editing client_state.xml, make sure you've completely exited BOINC-client before trying to edit the file, this includes exiting any BOINC-service or whatever it's called under Linux, if not the info in client_state.xml will just be overwritten with new info as BOINC runs.

If you have exited BOINC, and afterwards edited client_state.xml and the wrong URL somehow gets re-created on next start of BOINC, this would be very interesting, since it's much easier to test-out things that happens with just a re-start of client, and not something that only happens on 1st. download of a task...
59) Message boards : Number crunching : Upload problems (Message 41978)
Posted 12 Apr 2011 by Ingleside
Post:
Tried again with this z25h
This time stopped network before first file downloaded, got exact same results in both sched_reply and client.state -- 12 instances of 'hnndler' in client.state and no obvious errors in sched_reply.

Ok, so the problem is clearly on the client-side of things, and not on server-side or during communication to client. This is atleast a starting-point to try tracking down the problem...

Looking on my own pnw-tasks, it seems pwn is the only CPDN-tasks that uses http://boinc1.coas.oregonstate.edu/ as upload-server, but why this should have any effects seems strange.

The only other difference that seems to be present is the upload-handler is at /cpdn_cgi_main/ and the other CPDN-URL's is both shorter here and has only one _ or - so maybe this has any effects even it really shouldn't...

Then it comes to the size, /cpdn_cgi_main/ is total 13 letters, while some BOINC-projects uses longer. Example. SIMAP uses /boincsimap_cgi/ at 14 letters, while Einstein@home for the non-Arecibo-tasks uses /EinsteinAtHome_cgi/ meaning 18 letters. SIMAP has also longer total URL-length than the PNW-models, in case this has any meaning.


So, while I know this is probably not the reason for corruption, but could you also try to attach to SIMAP and Einstein@home, and see if you gets URL-corruption from these projects also?
60) Message boards : Number crunching : Upload problems (Message 41966)
Posted 10 Apr 2011 by Ingleside
Post:
What might be helpful, though rather onerous, is if some Linux user were to:

(a) download a PNW and suspend it before it starts (the download will still complete)

(b) make a backup of client_state.xml

(c) run the model

Repeat ad nauseam until there's an upload failure, then compare the current file with the backup. This would at least confirm what we suppose - that the corruption happens at the client end. (Or it might show the corruption happens before the model starts.)

I'll recommend one additional step:

a0: Immediately after being assigned a PNW-task, suspend network, and make a backup of sched_reply_climateprediction.net.xml, before enabling network again.

It's important that CPDN doesn't contact the scheduling-server again before making the backup.


If there's now a mis-spelling in sched_reply* it's either a server-side-problem, a problem during transfer from the scheduling-server, or a problem made by the client in handling of the scheduler-reply.

If sched_reply* had everything spelled correctly, but spelling-error shows-up in client_state.xml, it's a client-problem.


Previous 20 · Next 20

©2024 climateprediction.net