| Author | Message |
|
|
|
I started this thread because I have been wondering what the success to failure ratio is with the new FAMOUS WU’s. Please post how many FAMOUS WU’s you have run to completion and how many have crashed along the way. It might also be useful if you included the type of OS, the type of processor (Intel or AMD) and the processor speed.
I just seceded in finishing 1 FAMOUS model, but 2 others crashed along the way. This makes my success to failure rate 1:2 so far. OS is Windows7 64 bit and processor is Intel Core2duo 2.2 GHz.
____________
|
|
|
|
|
|
You should also include the 1st part of the model\'s name, e.g. r100_599, as some of the 1st part are know to be more reliable than others, (e.g. r109), and the start year affects how erratic the model is. Start-year 599 is a \'spinup\', and can be worse than a start-year further along.
My only mainsite model, a r185_799, is a little over halfway, with about 4 days to go.
edit
I forgot about this one:
r219_599, Intel P4 @3.2GHz, XP Pro.
Failed with the expected P_TH_ADJ : NEGATIVE PRESSURE VALUE CREATED |
|
|
|
|
|
... and don\'t worry about failures: the purpose of this group of FAMOUS work units is to separate those that lead to stable climates from those that don\'t. |
|
|
|
|
|
My 1 failure (negative pressure): r150_799.
My 2 completed successfully: r152_1199 and r152_1399.
PC is Intel Core 2 Quad @ 2.8MHz Win7 Home Premium. |
|
|
|
|
|
I don’t know if this means much now that the present version of the “FAMOUS“ model has been withdrawn, but, model
Famous_r125_1399_200_006632634_4 crashed at approx. 30% completion.
Windows7 64 bit running on an Intel Core2Duo T6600 2.2 GHz chip (4 GB of RAM).
____________
|
|
|
|
|
|
2 success r212_599, r182_599
0 failure
2 in progress
Phenom II X4 955, Win7 64 |
|
|
|
|
|
This system has now finished processing 15 models - Intel Q6600 @ 3.2, 4GB RAM, Win 7 Home x64.
2 successes - r112_1399, r193_999
13 failures, with key sterr out message lines. 8 were Theta related.
The popular reason...
r157_799, r168_1599, r168_1599 (a different one), r174_1599, r179_1399
Model crashed: P_TH_ADJ : NEGATIVE PRESSURE VALUE CREATED.
Variation?
r119_1399, r156_599, r175_1799
Model crashed: ATM_DYN : INVALID THETA DETECTED.
The remainder were likely caused by reboots, caused by power supply blips one night and flaky PSU in aftermath (sorted by clearing a build-up of static).
Error messages seem to suggest this kind of event.
Anyway, to complete the record for this machine...
r118_1199, r176_799, r185_799, r117_999, r215_599
Am maintaining similar logs for another 2 systems (4 + 15 models) and will post scores when all finished in 3 days time.
|
|
|
|
|
|
1 success
5 failures
|
|
|
|
|
|
One is still running.
6835949
One failure
Three successes
6835757
6835847
6836187
I notice my failed model has one success noted.
____________

Forum search Site search |
|
|
|
|
|
Problem OS-related ? WU 6835571 -> Windows is crashing, Linux running ;
my Model crashed too (Win XP Pro)
____________
|
|
|
mo.vForum moderator
 Send message Joined: Sep 29 04 Posts: 2270 Credit: 5,359,391 RAC: 1,327
|
|
If a model becomes unstable on one OS (Windows, Linux or Mac) plus processor type (Intel or AMD) it is likely to develop exactly the same instability at the same moment on other computers with the same OS + processor type combination. There are 5 combinations:
Windows + Intel
Windows + AMD
Linux + Intel
Linux + AMD
Mac + Intel
I\'ve looked through quite a few CPDN FAMOUS WUs to see the situation and have noticed that in a small number of cases one or more computer(s) with Windows + Intel develops an instability but another computer with the same combination processes past that point. In at least one case the other computer developed an instability later. This is rare.
HadSM iceworlds also depend on this OS + processor type combination. But in one case I saw 3 computers with Linux and a particular processor develop an iceworld while a fourth with the same combination completed the model normally. This is also rare.
The processor type matters because each deals with a particular aspect of the arithmetic differently. I think the difference lies in how each deals with rounding off the last value after the decimal point ie treatment of rounding errors.
[Edit: I didn\'t look into the likelihood that computers continuing past an expected instability point were overclocked. Insufficiently tested overclocking could generate processing differences.]
____________
Cpdn news
5 CPDN READMEs |
|
|
|
|
|
My other two systems have now finished their batches of Famous models.
Results a bit better than the first one which had only 2 passes from 15!
Links below are to Task details.
Intel Q6600 @ 2.4 stock, 3GB RAM, Win XP Pro SP3 (32-bit).
2 passed - r100_599, r185_799
2 failed -
r149_599
r186_799
Intel i7 920 @ 3.0, 6GB RAM, Win 7 Home (64-bit) - i.e. slightly overclocked.
8 passed - r107_1799, r144_799, r146_1199, r147_799, r148_599, r152_1199, r153_1399, r197_599
7 failed -
r145_999
r149_599
r151_999
r155_1799
r156_599
r158_999
r218_599
Meantime, back to running only SM3 and AM3P models. :) |
|
|
|
|
|
The last model finished successfully
6835949
So 4 out of 5
Details for this machine:
Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz [Intel64 Family 6 Model 26 Stepping 4] Microsoft Windows Vista Ultimate x64 Edition, Service Pack 2, (06.00.6002.00)
No overclocking
Running 24/7 alongside Milkyway on the GPU when available.
Also used for daily work and surfing and games...
FYI I\'ve just checked that my one failure was only a success on a Xeon running Darwin.
Hope this helps
____________

Forum search Site search |
|
|
|
|
The last model finished successfully
6835949
So 4 out of 5
Details for this machine:
Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz [Intel64 Family 6 Model 26 Stepping 4] Microsoft Windows Vista Ultimate x64 Edition, Service Pack 2, (06.00.6002.00)
No overclocking
Running 24/7 alongside Milkyway on the GPU when available.
Also used for daily work and surfing and games...
FYI I\'ve just checked that my one failure was only a success on a Xeon running Darwin.
Hope this helps
Either you have an incredibly stabile computer, or really great luck. Ever thought of betting on horse races. ;-)
____________
|
|
|
|
|
|
It\'s true I haven\'t had many issues with my models.
This one is an XPS MT. Cleaning it up after 8 months has cut the fan noise down.
Otherwise, I think the Vista Service Packs have helped with the occasional power outage.
I remember when you had to be extra careful shutting down, particularly on my laptop, probably due to delayed disk activity . But this desktop has only had one possibly related iceworld.
One thing I do is get Windows Update to ask me when to install patches, so I can shut down BOINC first (OTOH, I keep everything updated).
Another is check temperature. (I have to take my laptop apart every 6-8 months to clean it up)
Finally, sorry to say that with the 8 models running, I\'ve let go of regular backups.
____________

Forum search Site search |
|
|
|
|
[[B^S] mavau wrote:] One thing I do is get Windows Update to ask me when to install patches, so I can shut down BOINC first (OTOH, I keep everything updated).
Actually, that\'s a very good tip that we don\'t mention enough. Installing Windows updates (particularly if an automatic re-boot is triggered) has certainly caused problems for models I\'ve had running. Keeping the update warnings on and choosing when to download and install keeps things running smoothly.
|
|
|
|
|
|
Famous r131_1399_200_00632156_1 finished successfully. Windows7 64 bit Intel Core3Duo 2.2 GHz processor with 4 GB of RAM. That is my last famous WU from the first batch.
Does anyone know when the next batch will be released?
____________
|
|
|
mo.vForum moderator
 Send message Joined: Sep 29 04 Posts: 2270 Credit: 5,359,391 RAC: 1,327
|
|
At the moment on the Beta project we\'re testing 6.04 which has quite a high crash rate during the early years. Some of these crashes are caused by deliberately wild pertubations. Hiro\'s talking about another version, presumably beta, in which a filtering mechanism will prevent some of the crashes caused by wild parameter value pertubations. He and Tolu tried this before but it didn\'t work on the earlier version.
So it doesn\'t look as if a release on the main CPDN site is imminent.
If anyone with plenty of experience with CPDN model types + a willingness to look at their progress regularly + ability to report experiences on the forum wants to join Beta, send me a private message and I\'ll explain how to attach.
____________
Cpdn news
5 CPDN READMEs |
|
|
|
|
|
Hi, everyone:
I see that the FAMOUS models are back so I am reactivating this thread and asking people to report their successful completions and failures with this type of model. Please include processor type (Intel v. AMD), OS version, and amount of RAM. You might also include the s/TS and total time to complete the WU.
Hopefully this batch will be more stable than the last one was.
____________
|
|
|
|
|
|
Not much more stable, because of the science behind the modelling.
You WILL get failures, especially with the 'spinups'.
It's much like the early days of the project, 2003-2005, where the object is to find what parts of parameter space works and what doesn't.
|
|
|
|
|
|
I am afraid that Les is right. One WU already failed on my faster machine. It ran less than 1 hour of CPU time. It was gone so fast that I didn’t even get a chance to write down it designation. All I remember in that it started in 1799. I guess that there is no reason to make backups was the WU's will just fail again at the same point if restored.
Computer has an Intel 2.2 GHz processor running Windows 7 64 bit with 4 GB of RAM.
In science they say that even negative results are results. If it helps to weed out bad starting parameters from the good ones it’s worth the computer time.
____________
|
|
|
|
|
|
The WU famous_u0of_1599_200_006633730_6 crashed on 07/06/2010 at approx. 37% completion. Os is Windows 7 64 bit running on Intel Core2Duo 2.2 GHz processor with 4 GB RAM.
____________
|
|
|
|
|
|
Two crashes so far:
famous_u0nl_1599_200_006633700_4
famous_u0mr_1599_200_006633670_1
Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz [Intel64 Family 6 Model 26 Stepping 4] Microsoft Windows Vista Ultimate x64 Edition, Service Pack 2, (06.00.6002.00)
____________

Forum search Site search |
|
|
|
|
|
'NEGATIVE PRESSURE VALUE CREATED' on Q9300 in Vista_x64 after T.S. 140,426 (7.5%): famous_u0wv_1799_200_006634034_6
Edit: Three other Tasks in the Work Unit crashed with 'NEGATIVE PRESSURE'; three Tasks continue in progress.
____________
Greetings from coastal Washington state, the scenic US Pacific Northwest.
Important stuff no longer here: http://www.climateprediction.net/board/viewforum.php?f=44 |
|
|
|
|
|
Model famous_u0ct_999_200_006633312_5 crashed on Q9550 Quad Intel Win7 x64 at about 3.1% complete. |
|
|
|
|
|
Completed on Q9300 in XP_x64: famous_u0tv_1799_200_006633926_5 Temperature curves reach for the stars at the end
Includes Tambora, Krakatoa, Katmai, Pinatubo volcanic events.
Edit: That leaves me at 50% for v.6.10.
____________
Greetings from coastal Washington state, the scenic US Pacific Northwest.
Important stuff no longer here: http://www.climateprediction.net/board/viewforum.php?f=44 |
|
|
|
|
|
Model famous_u0o9_1599_200_006633724_6 crashed on C2D 6400 @ 2.13GHz WIN XP at about 38.25% complete. INVALID THETA DETECTED. |
|
|
|
|
|
Today - famous_upgs_1799_200_006665847_6 - Invalid Theta Detected.
18.5% completed on i7 920 @ 3.4, Win 7.
This machine has successfully completed 2 with 8 still running.
3 June - famous_u0ny_1799_200_006633713_1 - Invalid Theta Detected.
Failed before first trickle (after less than 1 hour) on a Q6600 @ 3.2, Win 7.
This machine has successfully completed 3 with one still running.
|
|
|
|
|
|
Latest results:
famous_u0pt_1999_200_006633780
I don't understand the difference in credit.
famous_u0pq_1999_200_006633777_3
That one completed with a workunit error?
____________

Forum search Site search |
|
|
mo.vForum moderator
 Send message Joined: Sep 29 04 Posts: 2270 Credit: 5,359,391 RAC: 1,327
|
|
I think you're refering to the phrase 'Workunit error - check skipped'. This line is really for Boinc projects that compare two or more completed tasks to validate them and decide which will be the canonical result, which I think means the definitive result for the researchers.
CPDN doesn't validate results by this method. Almost every completed result is used. So this line is irrelevant for CPDN.
There are often also red lines on workunit pages that are irrelevant, such as 'Too many results' or 'Too many errors - may have a bug'(!!). I don't know whether it would be possible for CPDN to hide these lines. Milo's too busy still fixing problems from the Boinc upgrade to ask at the moment.
It would definitely be better if CPDN members never saw these Boinc phrases.
If you look at the News thread post about FAMOUS (top of this Number Crunching section) it may explain what you need to know about the credits.
____________
Cpdn news
5 CPDN READMEs |
|
|
|
|
Latest results:
famous_u0pt_1999_200_006633780
I don't understand the difference in credit.
Granted credits are recalculated once a day. The 2 completed tasks should both have 6,176.41 credits tomorrow.
____________
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer |
|
|
|
|
|
Thanks for the info.
Another completed model with:'Workunit error - check skipped'.
famous_u0my_1799_200_006633677
Looking at the Work unit page, it seems connected with "Too many total results" as you say.
To sum up, those two messages are just BOINC artefacts. They don't concern CPDN and are nothing to worry about:
Workunit error - check skipped
Too many total results
____________

Forum search Site search
|
|
|
|
|
|
Famous_u0mw1999_200_006633675_1 completed successfully. OS is Windows 7 64 bit running on Intel Core 2 Duo 2.2 GHz with 4 GB’s of RAM. s/TS is 0.46.
____________
|
|
|
|
|
|
Summary for recent time away from home (since Friday):
Seven successful completions, seven crashes. (All in 64-bit Windows, Vista/W7/XP.) So, I remain at 50% for v.6.10.
____________
Greetings from coastal Washington state, the scenic US Pacific Northwest.
Important stuff no longer here: http://www.climateprediction.net/board/viewforum.php?f=44 |
|
|
|
|
|
So far 5 completed, two errors.
I'm changing preferences to run only Famous to see what happens.
____________

Forum search Site search |
|
|
|
|
|
Another crash, so 7 completed 3 crashes.
____________

Forum search Site search |
|
|
|
|
|
2 completed, no failures.
Win Xp Pro 4Gigs ram.
|
|
|
|
|
|
I thought this one would complete :-)
Inspiron laptop with Windows 7 Pro 64, 2Gigs RAM.
____________

Forum search Site search |
|
|
|
|
|
On 14 June, iansm wrote: Today - famous_upgs_1799_200_006665847_6 - Invalid Theta Detected.
18.5% completed on i7 920 @ 3.4, Win 7.
This machine has successfully completed 2 with 8 still running.
3 June - famous_u0ny_1799_200_006633713_1 - Invalid Theta Detected.
Failed before first trickle (after less than 1 hour) on a Q6600 @ 3.2, Win 7.
This machine has successfully completed 3 with one still running.
Another crashed in the 920 system today at 39.5%.
famous_upfc_1599_200_006665795_3 - Invalid Theta Detected.
3 completed, 2 crashed and 6 still running.
The Q6600 system finished its 4th and final model (meantime) with just the one failure. |
|
|
|
|
|
I had THIS ONE crash today.
Completed TS 1,141,946
Average time per TS 0.4505
System
AMD Athlon II 235e
6 Gigs ram.
Today I can't get a new one, they get download errors, checked and others get them on the same units, will run HADAM3P till I can get one tomorrow. Already got my quota of one per day.
____________
Keep on crunching Pizza@Home |
|
|
|
|
|
Core i3 530 2.93GHz, 2GB Kingston valueRAM, Gigabyte H55M UD2H mo'board, Linux Arch 2.6.33, 100% CPDN.
Crashed 3:
u0d9_0599 neg. press. 42,999 sec
upij_0799 theta 271,931 sec
u0s5_1999 neg. press. 155,591 sec
Completed 2:
u0s4_1799 1,029,859 sec
u0sp_1799 1,029,843 sec
In progress 1: u089_0599 - 90%
Mystery (says in progress on web page, but isn't on PC) 1: u0ch_1999 |
|
|
|
|
|
One mistake in my previous post: only 6 completed models.
And 7 crashes.
The latest:
famous_uow2_1799_200_006665101
famous_uoxh_1799_200_006665152
famous_uowz_1799_200_006665134
____________

Forum search Site search |
|
|
|
|
|
Famous_u0na_1799_200_006633689_6 crashed at 96% completion. OS is Windows 7 32 bit running on an Intel Core 2 Duo 1.5 GHz processor with 2 GB of RAM. 1.06s/TS RIP :(
____________
|
|
|
|
|
|
Famous _u0mw_1799_200_006634055_6 completed successfully. Os is Windows 7 32 bit running on Intel Core 2 Duo 1.5 GHz processor with 2 GB of RAM. 1.05s/TS. :) I seem to be running about 50% success rate on this type.
____________
|
|
|
mo.vForum moderator
 Send message Joined: Sep 29 04 Posts: 2270 Credit: 5,359,391 RAC: 1,327
|
|
I've looked at how some of the top computers are doing, adding together results for FAMOUS 6.10 and 6.11. I've not counted models with downloading errors as that was a server problem.
Peter, Linux: 6 completed, 5 errored
Ian Rees, Windows: 5 completed, 5 errored
Montes, Mac: 2 completed, 7 errored
Mike Koehler, Mac: 2 completed, 6 errored
Anonymous, Windows: 1 completed, 6 errored
This is less than the approx 50% success rate you estimate, but two factors make the above figures not entirely reliable.
* Models that crash take less computing time than completions.
* The list doesn't include partly processed models and the further a model has progressed the less likely it must be to crash, ie the more likely to succeed.
So I think the success ratio of these computers will probably increase as they have time to finish more models.
A more accurate estimate could be obtained by trawling through many workunits to see how many succeed on all platforms and how many crash on one, two or three. But this would be extraordinarily time-consuming. Because some computers crash models for non-model-related reasons one would need to look at the stderr of every model failure apart from those that couldn't get started because of a computer misconfiguration.
I will not be doing this.
The % of workunits that complete on all platforms must be lower than the average success % on members' computers.
One of us could look at those very stable top computers again after say another month.
____________
Cpdn news
5 CPDN READMEs |
|
|
|
|
|
One more crash in my i7 920 system (@3.4 with Win 7 Home x64) at 51.5%.
famous_upfd_1799_200_006665796_0 - Invalid Theta Detected.
3 completed, 3 crashed and 5 still running in this machine. |
|
|
|
|
Invalid Theta Detected.
Just in case anyone is wondering what 'theta' is: potential temperature.
|
|
|
|
|
|
Two more successes:
famous_up1h_1399_200_006665296
famous_uoxz_1799_200_006665170
8 completed models, 7 crashes, 8 running on the corei7 and 1 on the Inspiron.
____________

Forum search Site search |
|
|
|
|
|
Invalid Theta on this task: famous_r100_799_200_006666899_1.
So far, five completions, and one other Invalid Theta. All on Win7_x64. |
|
|
|
|
|
I'm now on 13 completed models and 10 crashes.
____________

Forum search Site search |
|
|
|
|
|
Famous_u0qu_1799_200_006667114_2 completed successfully. OS is Windows 7 64 bit running on a Intel Core 2 Duo 2.2 GHz with 4 GB’s of RAM.
____________
|
|
|
|
|
|
I've had a look in a little more depth at the FAMOUS success/failure stats from the first two pages of the 'Top Computers' list.
I tried to pick computers with at least 700,000 credits, so not "drive-bys". Compute errors only, as before.
Computer.......OS.........Pend+Invalid......Error.....Error%..Overall.Fail%
976458 Darwin 11 29 73
1013254 Darwin 4 29 88
1001600 Darwin 0 9 ALL
978938 Darwin 4 12 75
1063866 Darwin 3 27 90
83% Darwin
excluding 1001600: 82% Darwin
1000554 W7 2 3 60
961681 WSv2008 7 12 63
882224 WXP X64 5 2 29
55% Windows
1036870 Lin 2.6.16 16 8 33
1072992 Lin 2.6.32 6 7 54
1047400 Lin 2.6.32 FC12 7 6 46
42% Linux
Of course this is a snapshot, so you won't get these numbers now, or not all of them anyway. And early days, and all that. However.
Is it possible there is a problem with the MacOS code? Especially since most of the Darwin computers have relatively few failures with the other types of models.
Edit: will cross-post on CPDN board as this board seems to ignore the "pre" tag, so the table is not easy to follow. |
|
|
|
|
|
On my systems here at cpdn...
Core i7 920 in Linux
6 completed, 7 failed, 4 in progress
Phenom II X4 940 in Linux
7 completed, 5 failed, 4 in progress
Core 2 E6420 in Windows
2 completed, 0 failed, 1 in progress |
|
|
|
|
|
There's always the possibility of faulty data files, but ALL types of climate model are tested for months on our beta site.
It's possible that your comparisons are too simplistic.
As I said near the start of this thread, it's known that some of the series of models with "early label names" were being "pushed hard" with their forcing values, making them more unstable. (Some of the models that I have now, are up to the "u" series.)
And I also said there that the models with a start year of 599 are 'spinups', which are also more unstable than any of the subsequent year starts. As these later years use data from models of the previous year that completed, (which will allow these 2 years to be "stitched together" to form a longer year), it's more likely that the parameter values used are from a stable part of parameter space.
And they will definitely be using a spinup that was stable. :)
So your comparison would need to take into account these 2 items: the series name, and the start year of the models.
____________
Backups: Here |
|
|
|
|
|
On my own machine, Core i3 Linux, I have had 3 complete and 5 failed, a failure rate of 63%.
I have my suspicions about my computer's memory (Kingston valueRAM), even though it passes the memtest86+ test. I have underclocked the memory by 10% and the latest 4 models are running fine so far. Time will tell.
In case you can't decipher the messed-up table below, the essence was
Darwin failure rate 82%, Windows failure rate 55%, Linux failure rate 42%. Darwin seems to be an outlier. |
|
|
|
|
|
If I recall correctly from beta, the FAMOUS application for Darwin is using a higher optimization because they couldn't compile it without it. That may, or may not have anything to do with the failure rate.
As Les said, however, some of these sets will be inherently more unstable than others due to parameter choices. It's difficult to accept only a 50% success rate when it's previously been > 95%, but that's the nature of running this FAMOUS experiment. |
|
|
|
|
|
Famous_r149_799_200_006666483_5 completed successfully. OS is Windows 7 64 bit running on Intel Core 2 Duo 2.2 GHz processor with 4 GB of RAM.
I don’t know if it is just luck, but, this is 2 for 2 with the Famous models with the new graphics.
____________
|
|
|
|
|
|
More detailed investigation as suggested by Les.
Ignoring anything that is not "famous_uxxx_", and all with _599_ start year, i.e. looking at just "u series and not 599":-
Darwin Xeon (3 computers): 20 succeeded, 70 failed.
Darwin i7 (1 computer): 9 succeeded, 7 failed.
Win Opteron (1 computer): 6 succeeded, 6 failed.
Linux Xeon (2 computers): 15 succeeded, 9 failed.
Linux i7 (1 computer): 5 succeeded, 7 failed.
All of these are compatible with the "about fifty-fifty chance of failure" warning, except for Darwin Xeon. It could be just chance... but it might not.
(And actually, the r series and the "599s" don't make much difference to the percentages, in the tiny sample of computers I looked at.)
I'm not comparing the failure rate to anything--I've been away from the project for a few years, and only had about 10 SM3s before starting on famouses. I don't have Darwin, or a Xeon--more's the pity ;-). I'm just saying that there might be something to look into, using proper statistical methods.
Geophi - compiler (option) problems was my first guess. Famous models seem to be smaller than others, only about 30 MB resident rather than 100+ MB -- CPUs seem to spend less time moving data in and out from memory, and more time computing. Maybe the famous code has flushed out a very obscure intermittent bug.
And maybe it's just chance.
This is about as much investigation as I'm prepared to do without writing scripts, and it'd be better for someone who has direct access to the database to do that. So: leaving it there, thanks for listening. ;-) |
|
|
|
|
|
On the Darwin thing: I have 5 succeeded and 3 failed on beta. On main-project Windows, 1 succeeded and 3 failed. (Plus, the current beta WUs are apparently exploring a different parameter range - just to add to the confusion over success/failure ratios.) |
|
|
|
|
|
Famous_u0il_1799_200_006667077_3 finished successfully.
OS is Windows 7 64 bit running on Intel Core 2 Duo 2.2 GHz processor with 4 GB of RAM.
THREE IN A ROW AND COUNTING.
____________
|
|
|
|
|
|
Updated as of July 21, on my systems here at cpdn...
Core i7 920 in Linux
8 completed, 10 failed, 4 in progress
Phenom II X4 940 in Linux
11 completed, 7 failed, 4 in progress
Core 2 E6420 in Windows
3 completed, 1 failed, 1 in progress |
|
|
|
|
|
Q6600 2.4gig running stock.
Windows XP 64 bit.
i Famous run to completion;
and then on second Famous;
7/22/2010 7:02:13 AM climateprediction.net Started upload of famous_u01x_1799_200_006632920_5_6.zip
7/22/2010 7:02:36 AM climateprediction.net Finished upload of famous_u01x_1799_200_006632920_5_6.zip
7/22/2010 8:12:12 AM climateprediction.net Sending scheduler request: To send trickle-up message.
7/22/2010 8:12:12 AM climateprediction.net Not reporting or requesting tasks
7/22/2010 8:12:14 AM climateprediction.net Scheduler request completed
7/22/2010 8:38:35 AM climateprediction.net Resuming task famous_u01x_1799_200_006632920_5 using famous version 611
7/22/2010 9:10:01 AM climateprediction.net Computation for task famous_u01x_1799_200_006632920_5 finished
7/22/2010 9:10:01 AM climateprediction.net Output file famous_u01x_1799_200_006632920_5_7.zip for task famous_u01x_1799_200_006632920_5 absent
7/22/2010 9:10:01 AM climateprediction.net Output file famous_u01x_1799_200_006632920_5_8.zip for task famous_u01x_1799_200_006632920_5 absent
7/22/2010 9:10:01 AM climateprediction.net Output file famous_u01x_1799_200_006632920_5_9.zip for task famous_u01x_1799_200_006632920_5 absent
7/22/2010 9:10:01 AM climateprediction.net Output file famous_u01x_1799_200_006632920_5_10.zip for task famous_u01x_1799_200_006632920_5 absent
7/22/2010 9:10:01 AM climateprediction.net Output file famous_u01x_1799_200_006632920_5_11.zip for task famous_u01x_1799_200_006632920_5 absent
7/22/2010 9:10:01 AM climateprediction.net Output file famous_u01x_1799_200_006632920_5_12.zip for task famous_u01x_1799_200_006632920_5 absent
7/22/2010 9:10:01 AM climateprediction.net Output file famous_u01x_1799_200_006632920_5_13.zip for task famous_u01x_1799_200_006632920_5 absent
7/22/2010 9:10:01 AM climateprediction.net Output file famous_u01x_1799_200_006632920_5_14.zip for task famous_u01x_1799_200_006632920_5 absent
7/22/2010 9:10:01 AM climateprediction.net Output file famous_u01x_1799_200_006632920_5_15.zip for task famous_u01x_1799_200_006632920_5 absent
7/22/2010 9:10:01 AM climateprediction.net Output file famous_u01x_1799_200_006632920_5_16.zip for task famous_u01x_1799_200_006632920_5 absent
7/22/2010 9:10:01 AM climateprediction.net Output file famous_u01x_1799_200_006632920_5_17.zip for task famous_u01x_1799_200_006632920_5 absent
7/22/2010 9:10:01 AM climateprediction.net Output file famous_u01x_1799_200_006632920_5_18.zip for task famous_u01x_1799_200_006632920_5 absent
7/22/2010 9:10:01 AM climateprediction.net Output file famous_u01x_1799_200_006632920_5_19.zip for task famous_u01x_1799_200_006632920_5 absent
7/22/2010 9:10:01 AM climateprediction.net Output file famous_u01x_1799_200_006632920_5_20.zip for task famous_u01x_1799_200_006632920_5 absent
____________
|
|
|
mo.vForum moderator
 Send message Joined: Sep 29 04 Posts: 2270 Credit: 5,359,391 RAC: 1,327
|
|
Mike, all those messages about the missing files just means that the model crashed before it could generate those files.
Here's the crashed model's web page. If you click on stderr out + you'll see that it crashed because of NEGATIVE THETA ie caused by the model's initial parameters. Nothing to worry about. The researchers want us to run them whether they crash or complete.
____________
Cpdn news
5 CPDN READMEs |
|
|
|
|
|
Famous_up5n_1599_200_00665446_1 and Famous_umvv_1999_200_ 006662502_5 both completed successfully.
Famous_up5n_1599_200_00665446_1 OS is Windows 7 32 bit on a Core 2 Duo 1.5 GHz processor with 2 GB of RAM.
Famous_umvv_1999_200_ 006662502_5 was run on Windows 7 64 bit running on a Core 2 Duo 2.2 GHz processor with 4 GB of RAM.
This makes 5 successful completions in a row. Have they done something to improve stability or did the Scientists just front load most of the WU‘s with extreme parameters (that are more likely to fail) in the very early batches?
____________
|
|
|
|
|
|
As I said somewhere, the new model type takes us back to 2003-4, when the original 'slab' model was used.
The only way to find out which values lead to a long run, was to try them, and 'mark off' those values that caused early failures, and keep those that lasted the distance.
And this is what is happening with these totally different Millennium models: try everything and see what happens.
As it says here: Slogan : Historical climate records tell various stories  Let's test them all.
And it also says:
In addition to perturbations for internal physics parameters of the model and initial condition, this experiment requires a large number of forcing perturbations to deal with the large uncertainty in the historical forcings.
The very first versions were more unstable, so more testing was done to find failure points, and compiler options were also changed.
And the type 'name' series are using different degrees of values, and this affects the stability.
Remember, this is a short term project, and lots of climatologists are poring over the results as they come in. On beta, Hiro is watching as each new trickle arrives. Well, several times a day. :)
The current 'test' version, which has a name starting with s2..., is producing 'hot' results, and Hiro knows about these before they fail/complete.
No doubt something similar is happening on this main site as well.
____________
Backups: Here |
|
|
|
|
|
Famous_up23_1999_200_006665318 finished successfully.
OS is Windows 7 64 bit running on a Core 2 Duo 2.2 GHz processor with 4 GB of RAM.
____________
|
|
|
|
|
|
Famous_uky9_1599_200_006659996_3 failed at approx. 80% completion. OS is Windows 7 64 bit running on a Core 2 Duo 2.2 GHz processor with 4 GB of RAM.
____________
|
|
|
|
|
|
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=11460092
process exited with code 22 (0x16, -234)
Suspended CPDN Monitor - Suspend request from BOINC...
Model crashed: ATM_DYN : INVALID THETA DETECTED.
error
____________
|
|
|
|
|
|
I have also many FAMOUS models crashing in last few days, on Intel Pentium Dual CPU "E2200" at 2.2 GHz (native).
I have never had any crashing models before and nothing has changed in the computer. It is perfectlz stable.
Is there some workaround for those crashes?
computer:
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/show_host_detail.php?hostid=1051527
Peace and Love!
Filip
____________
|
|
|
mo.vForum moderator
 Send message Joined: Sep 29 04 Posts: 2270 Credit: 5,359,391 RAC: 1,327
|
|
Hello Overtonesinger
I've looked at the results for computer 1051527 which has an excellent list of model completions.
If you look at the web pages for the crashed FAMOUS models here and here and for each model click on + beside stderr you will see extra messages. Both models have exit code 22 and messages including NEGATIVE PRESSURE or INVALID THETA.
FAMOUS models are experimenting with some very extreme parameter values. In some cases this causes model crashes. It is not the fault of the computer; it's part of the experiment and even the crashed models are useful for Hiro, the researcher. If the crash is caused by the model parameter values you usually see NEGATIVE PRESSURE or INVALID THETA messages.
If you look at the workunit page for each crashed model (each model/task belongs to a workunit containing several copies of the same task) you see for example this. The processing of models depends on a combination of the computer's CPU (Intel or AMD) and its operating system (Windows, Linux or Mac/Darwin). Computers with the same combination usually all complete or all crash at the same processing moment. You will see that the two computers with Darwin crashed at the same moment, but not at the same moment as your computer which has Windows. The computer with Linux may complete the model.
But if we look at the other workunit we find two computers that couldn't start the model. Their models have -226 and -185 exit codes. These mean there's a problem in those computers. Their firewall or antivirus is probably blocking Boinc.
Don't try to back up or restore FAMOUS models that crash on a stable computer. They would crash again at the same processing moment.
____________
Cpdn news
5 CPDN READMEs |
|
|
|
|
|
Famous_u1eh-1199_200_006634660_4 completed successfully.
Famous_u42q_1799_200_006638125_5 failed and Famous_u57z_1799_200_ 006639610_3 failed at approx. 45% completion. OS is Windows 7 64 bit running on a Core 2 Duo 2.2 GHz processor with 4 GB of RAM.
____________
|
|
|
|
|
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=11460092
process exited with code 22 (0x16, -234)
Suspended CPDN Monitor - Suspend request from BOINC...
Model crashed: ATM_DYN : INVALID THETA DETECTED.
error
Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy
update ^^
____________
|
|
|
|
|
|
At risk of jinxing myself (Superstitious? Who? Me?), I've had more FAMOUS successes than failures lately, both here and on Beta. (Fingers crossed ...) May that be, or soon become, true for everyone.
____________
Greetings from coastal Washington state, the scenic US Pacific Northwest.
Important stuff no longer here: http://www.climateprediction.net/board/viewforum.php?f=44 |
|
|
|
|
|
I hate to say this but I felt the same way a little while back. Had 5 successes in a row. Since then I have had 3 crashes, with only 1 successful completion. I guess the law of averages is catching up with me.
[/quote]
____________
|
|
|
|
|
|
First task was a success
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=11432527
but the second crashed:
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=11474733
I had the message that a certain .zip-file wasn't there (I can't remember the full file-name :( )
greetz from Switzerland
littleBouncer
____________
  |
|
|
|
|
|
Missing zip file messages are normal when a model crashes - if the model hasn't progressed to the point where the file is created, then BOINC can't find it to upload it.
The real messages about the failure are on the web page for the model; click the + sign alongside stderr to see them.
|
|
|
|
|
|
I have so far 21 successes's 10 failures all "negative theta",6 in progress ,4 waiting to run(reserve supply because of down load problems).
As failures run for a shorter time this will skew the results in the short term and failures will appear higher than they actually are and a true ratio will become apparent over the longer term.
Two of my m/c's run linux and one windows 7 failure rates seem about the same.
When checking out the others in w.u. of failed models I was surprised by the differences on windows system between Xp,vista and 7 whether it failed or not and how far it got one would expect them all to fail at the same point which they generally did when running the same o.s.
Two of the w.u.'s had a lot of linux's one they all failed at the same point the other they were all different (you can't win)
Perhaps more research needs to be done on this to see if it is true or not and not just a coincidence on the ones I looked at. |
|
|
|
|
|
Famous_u6f5_1399_200_006641826_6 failed at appromx. 95 % on a machine running Windows7 64 bit with 2.2 GHz Core 2 Duo processor and 4 GB of RAM.
Famous_1399_200_006641826_6 completed successfully on a machine running Windows7 64 bit with 2.2 GHz Core 2 Duo processor and 4 GB of RAM.
Famous_u34a_1799_200_006636885_2 completed successfully on machine running Windows7 32 bit with Core 2 Duo 1.5 GHz processor and 2 GB of RAM.
Famous_u34a_1799_200_006636891_1 completed successfully on a machine running Windows7 32 bit with Core 2 Duo 1.5 GHz processor and 2 GB of RAM.
____________
|
|
|
|
|
|
Don't know if this is of any interest but it might be, because team Scotland members have quite a good record of completing long models, from BBC onwards. From this page of Iansm's brilliant stats for the team, it can be seen that, of 484 FAMOUS models issued to team members to date, 210 have completed and 170 have failed.
____________
Visit the Scotland team
 |
|
|
|
|
|
Famous_ua0y_1799_200_006636885_2 failed at appromx. 12% running on 2.2 GHz Core 2 Duo processor running Windows 7 64 bit. At least this one had the good grace to fail early (36 hours) and not after 11 days (95%) of running.
____________
|
|
|
|
|
|
Just had one of mine fail at about 34%, with a different error this time - i.e. not "invalid theta": famous_ubod_599_200_006647976_2.
The error was
SETPOS: Seek Failed: Invalid argument
SETPOS: Unit 61 to Word Address -198 Failed with Error Code -1
Model crashed: SETPOS: Unit 61 to Word Address -198 Failed with Error Code -1
repeated 6 times. Same exit code 22, though.
This breaks a run of 7 successes. Totals so far: 17 completed, 9 failed (plus 3 "download errors" from the server glitch back in June). |
|
|
|
|
|
Just a note on those 3 "download errors": Two of them didn't get processed at all:
famous_uopf_1599_200_006664862 and famous_uopj_1799_200_006664866'
I wonder how many more work units are like that, and whether it will be a problem for the experiment? |
|
|
|
|
|
Greg
Your recent failure was Invalid theta. The other messages are most likely what happened when the program was suddenly diverted to a different (incorrect) area of code by the failure. The researchers will pick it up when looking through the lists, so not a problem for you.
The models that didn't arrive due to download errors are called phantom models.
And they are a problem to the project, because there's less chance of that particular combination getting processed by someone else. (No chance, if all of the batch failed to download.)
If the area of parameter space involved with the download problems at that time is important enough to which ever physicists are running those models, then they'll request that they be included again at some point.
|
|
|
|
|
|
Les - well, maybe. The model ran for about 20 hours after the third and last "Invalid Theta" message appeared in stderr.txt. (Note to programmers: it'd be handy if error messages were timestamped.)
All of my Famous models have logged at least one "invalid theta" message, but the majority go on to completion. I guess the code's "back up and re-try" works ;-).
As well as the "download error" models, I have two "normal" phantoms: famous_u0ch_1999_200_006633300_5 and famous_ulrv_799_200_006661062_0
These phantoms are "In Progress" according to the web site, but never made it to my machine. I recall watching (in the Boinc Manager) one of the download files, for u0ch, get to about 90% downloaded - and then just vanish. Not to worry: someone else managed a complete run for that work unit. |
|
|
|
|
|
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6856284
hello ^^
error 6
O_o
____________
|
|
|
|
|
|
Famous_ueet_999_200_006651520_4 failed. Reason: Model crashed: ATM_DYN : INVALID THETA DETECTED. Computer is Windows 7 64 bit with Intel Core 2 DUO 2.2 GHz processor with 4 GB of RAM.
____________
|
|
|
|
|
|
Model crashed: ATM_DYN : INVALID THETA DETECTED. three results of that WU did that already.
I still have 7 active Famous 6.11 and a bunch of finished ones on that box. Besides the one mentioned here no errors so far. |
|
|
|
|
|
Famous_u9rf_1599_200_006645494_3 finished successfully. OS is Windows 7 64 bit running on a Core 2 Duo 2.2 GHz processor with 4 GB of RAM.
____________
|
|
|
|
|
|
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=11515402
22 error ^^
Le périphérique ne reconnait pas la commande. (0x16) - exit code 22 (0x16)
28-Aug-2010 22:01:05 [climateprediction.net] Started upload of famous_ufhh_1599_200_006652912_4_8.zip
28-Aug-2010 22:01:06 [climateprediction.net] Sending scheduler request: To send trickle-up message.
28-Aug-2010 22:01:06 [climateprediction.net] Not reporting or requesting tasks
28-Aug-2010 22:01:12 [climateprediction.net] Scheduler request completed
28-Aug-2010 22:04:20 [climateprediction.net] Finished upload of famous_ufhh_1599_200_006652912_4_8.zip
28-Aug-2010 23:10:23 [climateprediction.net] Computation for task famous_ufhh_1599_200_006652912_4 finished
28-Aug-2010 23:10:23 [climateprediction.net] Output file famous_ufhh_1599_200_006652912_4_9.zip for task famous_ufhh_1599_200_006652912_4 absent
28-Aug-2010 23:10:23 [climateprediction.net] Output file famous_ufhh_1599_200_006652912_4_10.zip for task famous_ufhh_1599_200_006652912_4 absent
28-Aug-2010 23:10:23 [climateprediction.net] Output file famous_ufhh_1599_200_006652912_4_11.zip for task famous_ufhh_1599_200_006652912_4 absent
28-Aug-2010 23:10:23 [climateprediction.net] Output file famous_ufhh_1599_200_006652912_4_12.zip for task famous_ufhh_1599_200_006652912_4 absent
28-Aug-2010 23:10:23 [climateprediction.net] Output file famous_ufhh_1599_200_006652912_4_13.zip for task famous_ufhh_1599_200_006652912_4 absent
28-Aug-2010 23:10:23 [climateprediction.net] Output file famous_ufhh_1599_200_006652912_4_14.zip for task famous_ufhh_1599_200_006652912_4 absent
28-Aug-2010 23:10:23 [climateprediction.net] Output file famous_ufhh_1599_200_006652912_4_15.zip for task famous_ufhh_1599_200_006652912_4 absent
28-Aug-2010 23:10:23 [climateprediction.net] Output file famous_ufhh_1599_200_006652912_4_16.zip for task famous_ufhh_1599_200_006652912_4 absent
28-Aug-2010 23:10:23 [climateprediction.net] Output file famous_ufhh_1599_200_006652912_4_17.zip for task famous_ufhh_1599_200_006652912_4 absent
28-Aug-2010 23:10:23 [climateprediction.net] Output file famous_ufhh_1599_200_006652912_4_18.zip for task famous_ufhh_1599_200_006652912_4 absent
28-Aug-2010 23:10:23 [climateprediction.net] Output file famous_ufhh_1599_200_006652912_4_19.zip for task famous_ufhh_1599_200_006652912_4 absent
28-Aug-2010 23:10:23 [climateprediction.net] Output file famous_ufhh_1599_200_006652912_4_20.zip for task famous_ufhh_1599_200_006652912_4 absent
____________
|
|
|
|
|
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=11515402
22 error ^^
Le périphérique ne reconnait pas la commande. (0x16) - exit code 22 (0x16)
...
This is a "Theta" issue too, the filetransfer errors are just results of that Theta thing. |
|
|
|
|
|
Success/failure ratio rises as 'no go' parameter space is identified and avoided, but if combinations of physically-plausible parameter values fail then does this suggest that the general model is not robust? |
|
|
|
|
|
Hypothetical question. In general, the researchers don't know which combinations of perturbed parameters are plausible until they're tried and have identical failures, or similar completions, within a Task. (We're still testing this in Beta.) The range of possible parameter combinations and perturbations is vast.
The Models we run are not untested. They were developed by the U.K. MetOffice and are used in regular weather and climate applications; our task in Beta is to test the envelope that allows a SuperComputer Model to run on a PC, as well as parameter ranges. (CPDN's goal is not "the" solution for the "climate problem." Rather, it is to understand a reasonable range. There is quite a bit of Project and science background information on the other Boards, starting with the home page. http://climateprediction.net/)
Edit: Added hot link.
____________
Greetings from coastal Washington state, the scenic US Pacific Northwest.
Important stuff no longer here: http://www.climateprediction.net/board/viewforum.php?f=44 |
|
|
|
|
|
Thanks for your response. I posted here because it related to the thread subject, albeit with wider implications.
Maybe it's pointless to pursue this if the question is considered naive, ill-informed or trivial. Do you mean that because the model is validated already then combinations of physically-plausible parameter values (i.e. values that are realistic on the basis of physical measurements) will not fail, or are some model parameters not directly related to physical measurements? |
|
|
|
|
|
The model being validated just means that the program software is OK as far as is known. But that's with the combinations of hardware and software that the testers used.
All 'climate' parameters/values can fail if used in certain combinations. Or if the models were to be run for longer periods.
If the models DON'T fail from instability, then they can still do so because of the hardware/software used on the computer running the model.
e.g. Some people overclock their computers and say that they're still stable. But the Floating Point Unit, (FPU), that is used for lots of calculations may have trouble providing data at the faster rate, and give values that cause the model to be slightly different to what it would be if the computer wasn't overclocked. And, over time, these slight differences add up.
____________
Backups: Here |
|
|
|
|
|
Perhaps I'm not making myself clear. By 'validated' I mean that given 'sensible' inputs the model generates sensible outputs, for example making accurate predictions from historical data sets. Repeat runs of a parameter combination will presumably identify variation due to software-hardware interaction. But is there a straightforward answer to my question? |
|
|
|
|
|
Back when we ran the original 200-year ocean Spinups for the 180-year HadCM3 Tasks, there was a baseline, unperturbed, Task thrown into the mix. On the other hand, none of the Spinups had particularly aggressive parameters because the goal was a set of ocean files to put into HadCM3 Tasks, so every participant wouldn't have to run that nearly four months of work to get to the three-plus-month Task at hand. If I recall correctly, the Spinups didn't crash - unless the computer did it (as one of mine did, within hours of completion after nearly four months on a Pentium-4, thanks to a power glitch that found its way to the machine despite a UPS unit [fortunately, I made daily backups]).
Except for the aside about my machine, is that within range of what you are getting at? (I confess to not understanding what you really want to know.)
____________
Greetings from coastal Washington state, the scenic US Pacific Northwest.
Important stuff no longer here: http://www.climateprediction.net/board/viewforum.php?f=44 |
|
|
|
|
Success/failure ratio rises as 'no go' parameter space is identified and avoided, but if combinations of physically-plausible parameter values fail then does this suggest that the general model is not robust?
It is sometimes challenging to state what a physically plausible parameter value is. Processes (like thunderstorms or individual clouds) that are too small scale to model in the large grids scale of the model have to be parameterized. This describes parameters from the basic experiment strategy for older models. Individual links within this text take you to further explanations of parameters:
Parameters
Every climate model has to make a number of approximations, called parameterisations. To read more about these, click here. Basically this means that there are numbers in the model which are given a certain, fixed value, but this value is not known for sure and a range of values could be equally realistic. The experiments will investigate the effect on the modelled climate of varying the value of 20 of the most poorly understood parameters in the model - such as the relationship between the number of raindrops in a cloud and how much it actually rains (to see what they are, click here). It is possible that some combinations of parameters may replicate the past climate equally well, but produce widely different forecasts for what might happen in the future. Some combinations of parameters will not work at all, produce a completely unrealistic climate ( for example an Earth that boils or freezes, or oscillates between very hot and very cold every couple of years) and probably crash the model. It is not possible for us to tell beforehand what these combinations will be.
And this is a very good description of the millennium experiment which talks about why some models in this experiment are expected to fail. |
|
|
|
|
|
Thank you all. I'm a bit closer to understanding now. |
|
|
|
|
|
Famous_u9d4_599_200_006644979_1 completed successfully.
OS is Win7 32 bit running on a Core 2 Duo 1.5 GHz processor with 2 BG of RAM.
____________
|
|
|
|
|
|
Famous_u9no_1399_200_006645359_3 finished successfully. OS is Windows 7 64 bit running on a Core 2 Duo 2.2 GHz processor with 4 GB of RAM.
____________
|
|
|
|
|
|
Sorry to report that my Famous_ubdx_599_200_006647600_0 has crashed with an "unrecoverable error" :-(
____________
Visit the Scotland team
 |
|
|
|
|
|
Or more explicitly, with: INVALID THETA
____________
Backups: Here |
|
|
|
|
|
Famous_ufb3_999_200_006652682_2 completed successfully. OS is Windows 7 64 bit running on a Core 2 Duo 2.2 GHz processor with 4 GB of RAM.
|
|
|
|
|
Or more explicitly, with: INVALID THETA
Thanks Les, that info wasn't yet showing when I first posted. When the "Invalid Theta" message did appear, I meant to come back and amend my post but got kinda sidetracked, as happens around here! Thanks for clarifying. ;-)
____________
Visit the Scotland team
 |
|
|
|
|
|
27 tasks finished by MacBookPro Intel Core Duo 2.16 GHz running Darwin 9.8.0
Completed u series 6 v series 1
Error while computing 11 9
Totals 17 10
Only one was for year 599, and was a completed u series task.
2 v series In progress have been excluded as also have v series 2 ghosts, which are "in progress" due to a resetting of CPDN.
Keith |
|
|
|
|
|
Famous_uiav_599_200_006656562_1 completed on Core2Quad Q6600 @2.4GHz Windows XP Home.
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=11533655 |
|
|
|
|
|
Famous_v0ta_1799_200_006686828_3 failed. Reason “Invalid Thetaâ€Â. OS is Windows 7 32 bit running on a Core 2 Duo 1.5 GHz processor with 2 GB of RAM. |
|
|
|
|
|
Famous_ubr6_1799_200_006648077_5 failed. Reason invalid theta. OS is Windows 7 64 bit SP1 beta running on a Core 2 Duo 2.2 GHz processor with 4 GB of RAM.
____________
|
|
|
|
|
|
famous_ue4u_799_200_006651161_6 failed at 84%, win7-intel. Nothing unusual about the temperature chart.[/url] |
|
|
|
|
|
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=11696619
BUFFIN: Read Failed: No such file or directory
BUFFIN: C I/O Error feof - Unit 60 - Return code = 16
BUFFIN: Read Failed: No such file or directory
BUFFIN: C I/O Error feof - Unit 61 - Return code = 16
BUFFIN: Read Failed: No such file or directory
BUFFIN: C I/O Error feof - Unit 68 - Return code = 16
BUFFIN: Read Failed: No such file or directory
BUFFIN: C I/O Error feof - Unit 69 - Return code = 16
____________
|
|
|
|
|
|
The BUFFIN errors happen when a FAMOUS task is removed from memory between generating a trickle and the next checkpoint. The task is restarted from the checkpoint before the trickle and the error is generated when a second attenmpt is made to post-process the data for the previous year. The errors are harmless.
____________
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer |
|
|
|
|
|
Famous_uiau_1999_200_006656561_2 completed on Core2quad Q6600 Windows XP Home.
Workunit error - check skipped.
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=11533651
|
|
|
mo.vForum moderator
 Send message Joined: Sep 29 04 Posts: 2270 Credit: 5,359,391 RAC: 1,327
|
|
This 'Workunit error - check skipped' message means nothing for CPDN because our models aren't validated in the same way as tasks from other projects. It's a confounded nuisance and must put some people off. I don't know whether Milo could get rid of it.
Similarly, on your model's workunit page we see 'Too many total results' which is another irrelevant message.
____________
Cpdn news
5 CPDN READMEs |
|
|
|
|
|
John
The only messages relevant to climate models are found in stderr, which is on the main model page.
As there aren't any error messages, and it says further up the page: Over Success Done, that particular model is just that; a success.
____________
Backups: Here |
|
|
|
|
|
My latest stats:
Completed UK Met Office FAMOUS v6.11
Completed UK Met Office FAMOUS v6.11
Completed UK Met Office FAMOUS v6.11
Completed UK Met Office FAMOUS v6.11
Completed UK Met Office FAMOUS v6.11
Completed UK Met Office FAMOUS v6.11
Completed UK Met Office FAMOUS v6.11
Completed UK Met Office FAMOUS v6.11
Completed UK Met Office FAMOUS v6.11
Completed UK Met Office FAMOUS v6.11
Completed UK Met Office FAMOUS v6.11
Completed UK Met Office FAMOUS v6.11
Completed UK Met Office FAMOUS v6.11
Completed UK Met Office FAMOUS v6.11
Completed UK Met Office FAMOUS v6.11
Completed UK Met Office FAMOUS v6.11
Completed UK Met Office FAMOUS v6.11
Completed UK Met Office FAMOUS v6.11
Completed UK Met Office FAMOUS v6.11
Completed UK Met Office FAMOUS v6.11
Completed UK Met Office FAMOUS v6.10
Completed UK Met Office FAMOUS v6.10
Completed UK Met Office FAMOUS v6.10
Completed UK Met Office FAMOUS v6.10
Completed UK Met Office FAMOUS v6.10
Completed UK Met Office FAMOUS v6.10
Completed UK Met Office FAMOUS v6.10
Completed UK Met Office FAMOUS v6.10
Completed UK Met Office FAMOUS v6.10
Completed UK Met Office FAMOUS v6.10
Completed UK Met Office FAMOUS v6.10
Completed UK Met Office FAMOUS v6.11
Completed UK Met Office FAMOUS v6.11
Completed UK Met Office FAMOUS v6.11
Completed UK Met Office FAMOUS v6.11
Completed UK Met Office FAMOUS v6.11
Completed UK Met Office FAMOUS v6.11
Completed UK Met Office FAMOUS v6.11
Completed UK Met Office FAMOUS v6.11
Completed UK Met Office FAMOUS v6.11
Completed UK Met Office FAMOUS v6.11
Completed UK Met Office FAMOUS v6.11
Completed UK Met Office FAMOUS v6.11
Completed UK Met Office FAMOUS v6.11
Completed UK Met Office FAMOUS v6.11
Completed UK Met Office FAMOUS v6.11
Completed UK Met Office FAMOUS v6.11
Completed UK Met Office FAMOUS v6.11
Completed UK Met Office FAMOUS v6.11
Completed UK Met Office FAMOUS v6.11
Completed UK Met Office FAMOUS v6.11
Completed UK Met Office FAMOUS v6.11
Completed UK Met Office FAMOUS v6.11
Completed UK Met Office FAMOUS v6.11
Completed UK Met Office FAMOUS v6.11
Completed UK Met Office FAMOUS v6.11
Completed UK Met Office FAMOUS v6.11
Completed UK Met Office FAMOUS v6.11
Completed UK Met Office FAMOUS v6.11
Completed UK Met Office FAMOUS v6.11
Completed UK Met Office FAMOUS v6.11
Completed UK Met Office FAMOUS v6.11
Completed UK Met Office FAMOUS v6.11
Error while computing UK Met Office FAMOUS v6.11
Error while computing UK Met Office FAMOUS v6.11
Error while computing UK Met Office FAMOUS v6.11
Error while computing UK Met Office FAMOUS v6.11
Error while computing UK Met Office FAMOUS v6.11
Error while computing UK Met Office FAMOUS v6.11
Error while computing UK Met Office FAMOUS v6.11
Error while computing UK Met Office FAMOUS v6.11
Error while computing UK Met Office FAMOUS v6.11
Error while computing UK Met Office FAMOUS v6.11
Error while computing UK Met Office FAMOUS v6.11
Error while computing UK Met Office FAMOUS v6.11
Error while computing UK Met Office FAMOUS v6.11
Error while computing UK Met Office FAMOUS v6.11
Error while computing UK Met Office FAMOUS v6.10
Error while computing UK Met Office FAMOUS v6.10
Error while computing UK Met Office FAMOUS v6.10
Error while computing UK Met Office FAMOUS v6.10
Error while computing UK Met Office FAMOUS v6.10
Error while computing UK Met Office FAMOUS v6.10
Error while computing UK Met Office FAMOUS v6.10
Error while computing UK Met Office FAMOUS v6.10
Error while computing UK Met Office FAMOUS v6.11
Error while computing UK Met Office FAMOUS v6.11
Error while computing UK Met Office FAMOUS v6.11
Error while computing UK Met Office FAMOUS v6.11
Error while computing UK Met Office FAMOUS v6.11
Error while computing UK Met Office FAMOUS v6.11
Error while computing UK Met Office FAMOUS v6.11
Error while computing UK Met Office FAMOUS v6.11
Error while computing UK Met Office FAMOUS v6.11
Error while computing UK Met Office FAMOUS v6.11
Error while computing UK Met Office FAMOUS v6.11
Error while computing UK Met Office FAMOUS v6.11
Error while computing UK Met Office FAMOUS v6.11
63 completed, 35 errors, not counting the phantoms.
____________

Forum search Site search |
|
|
mo.vForum moderator
 Send message Joined: Sep 29 04 Posts: 2270 Credit: 5,359,391 RAC: 1,327
|
|
Thank you for the results of such a large number of models. Superficially this appears to mean a success rate of about 64% and a failure rate of about 36%. However, as the failures take less time to run because they crash before the end, the failure rate must be lower (if we mean the probability that any model will complete or fail).
I'm not sure how to calculate this.
Ideally the calculation would need to take into account whether on average the crashes occur at 50% completion (ie are equally likely to happen at any processing moment). I don't know this.
____________
Cpdn news
5 CPDN READMEs |
|
|
|
|
|
Maybe if you incorporate the CPU-time a better idea of failure rate can be gotten.
Looking at the stats for the first host of [B^S] mavau's list, the total CPU time spent on FAMOUS models comes to approximately 61 million seconds (60785865.05). About 48 million seconds of those (48398651.3) were spent on successfully completed models. Maybe it is fair to say that makes for a 80% success rate for that particular host? Those numbers are based on 87 models (55/32).
Or would you have to take into account the time spent if all models had completed successfully? In that case you'll get an about 63% success rate |
|
|
|
|
|
I've been through my results in more detail.
First, some of my errors were successes on other platforms.
Error while computing Darwin success
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6923876
Error while computing Linux success
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6922152
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6919805
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6867265
Error while computing Darwin and Linux success
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6869834
Error while computing Windows 7 64-bit AMD success
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6870035
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6868920
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6867868
Error while computing XP AMD success
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6868473
And here is a look at my completed models failing on other platforms/combinations.
This is not a full list. I've tried to exclude computers with constant failures, immediate failures...
I haven't checked every failure. I've noted a few disk errors on Windows 7 I hadn't seen before towards the end of the list.
Note the large number of invalid thetas on Darwin
Linux AMD
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6837149
Linux Xeon
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6870087
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6870008
Darwin
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6867021
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6865840
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6889698
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6868148
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6868668
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6918367
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6918280
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6894568
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6922925
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6895570
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6893598
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6918731
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6918391
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6921744
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6920379
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6935286
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6938524
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6940555
Linux and Darwin
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6869410
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6919132
XP AMD
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6865756
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6889575
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6868483
XP Intel
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6865602
Windows 7 AMD Vista AMD
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6837049
Server 2003 AMD Linux Intel
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6868542
Windows 7 64 AMD Darwin
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6921166
Windows 7 64 Intel disk error
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6918496
Windows 7 64 Intel disk error and Darwin
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6921700
XP AMD Darwin
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6921012
____________

Forum search Site search |
|
|
|
|
|
An early failure I'd missed (application doesn't show in the right column).
Windows machines fail at the same point, Linux a little bit later, and Darwin succeeds.
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/workunit.php?wuid=6835898
____________

Forum search Site search |
|
|
|
|
|
The reason for the larger proportion of Darwin failures, is because the compiler used couldn't be set to not use SSE2 on Macs.
So, while Windows and Linux were eventually set to not use SSE2, and therefore be more stable, (but slower), Macs weren't.
(All of this was during testing on the beta site.)
Statistics is beset with problems. 
____________
Backups: Here |
|
|
|
|
|
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=12006782
famous_w9iy_599_200_006759034_0
FAIL cold temperature °C
____________
|
|
|
mo.vForum moderator
 Send message Joined: Sep 29 04 Posts: 2270 Credit: 5,359,391 RAC: 1,327
|
|
Yes, what a cold graph.
____________
Cpdn news
5 CPDN READMEs |
|
|
|
|
|
I don’t think that I have ever seen a graph like that before. That is more than just a cold snap. It is more a glacial age. It looks more like the entire Earth was entering a snowball phase like what geologists think happened about 700 million years ago.
____________
|
|
|
|
|
|
Famous_v1eo_999_200_00672929-1 failed about 56 years. Invalid theta. OS is Windows 7 64 bit SP1 RC running on a Core 2 Duo 2.2 GHz processor with 4 GB of RAM.
____________
|
|
|
|
|
|
Here's one model I'm curious about (getting very cold).
Let's see how it develops:
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=12000683
Another note, this w series seems to run much slower (.55 v .48. on my machine).
____________

Forum search Site search |
|
|
|
|
|
From a post by Hiro on the beta site:
On the main site, we have just started famous_w series of experiment using the same version of Famous.
The initial workunits are spin up runs with a wider range of parameters, including a new parameter for the number of dynamic sweeps. Actually, we perturbed the sweep parameter before, but only for a few work units.
and later:
To add a bit of background, we started using 2 sweep dynamics to stabilize the model. This effectively make the time step of the atmospheric _dynamics_ by half. However, the run speed hardly increases because the atmospheric dynamics (excluding what we call "physics" and radiative transfer) is a very small in term of CPU time.
According to my 5 or so cluster runs for the millennium and some results from Bristol group, this eliminates most of the cold crashes (still not perfect, though).
I think that the 2nd post also refers to the "w" series models.
____________
Backups: Here |
|
|
|
|
|
I had missed this cold failure:
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=11997733
And this cold mode l is still running:
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=12000683
____________

Forum search Site search |
|
|
|
|
|
That second one took some time dying. Very cold.
____________

Forum search Site search |
|
|
|
|
|
Another cold failure:
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=12234436 |
|
|
|
|
|
Famous_ kHz_1999_200_006712331_3 completed successfully. OS is Windows 7 64 bit SP1 RC running on a Core 2 Duo 2.2 GHz processor with 4 GB of RAM.
A very warm one. Average temp. rising steadily throughout the 21th and 22nd centuries from 17.2 to 22.9 degrees C. Rise is greatest in the Northern Hemisphere were it top out at 24.4C. Solar constant is at default.
____________
|
|
|
|
|
|
In the famous_w0xx_599 series, I've had two fail and one succeed, so far.
One of the failures was a runaway, reaching 38.5 Celsius before crashing. The other was a cold world, crashing at 8.7 C.
The one that succeeded had quite extreme-looking values for ice fall speed, entrainment coefficient, and temp range of ice albedo variation. You just can't tell.
Back on "v series" famouses now - the luck of the draw. |
|
|
|
|
|
Overall
130 success, 73 failures while computing (64% success ratio)
w-series
2 success, 12 failures while computing (14% success ratio)
Core i7 920 Linux
Success
All/w-series
52/0
Computing Failure
All/w-series
30/2
Phenom II X4 940 Linux
Success
All/w-series
50/0
Computing Failure
All/w-series
27/6
Phenom II X6 1090T Linux
Success
All/w-series
12/2
Computing Failure
All/w-series
7/1
Phenom II X2 B93 Windows
Success
All/w-series
10/0
Computing Failure
All/w-series
5/1
Core2 E8600 Windows
Success
All/w-series
6/0
Computing Failure
All/w-series
4/2 |
|
|
|
|
|
Had a big crash (6 models) 12 days ago, due to disk issues (bad sectors).
Early symptom: McAfee check took ages to complete.
I eventually noticed all the disk error messages in Event Viewer.
This solution should work some time:
Have chkdsk identify the bad sectors once in a while (second checkbox).
The new batch has been successful, except for:
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=12400747
I hadn't met the error before.
Some links below for everybody's information:
http://ncas-cms.nerc.ac.uk/trac/UMHelpdesk/ticket/399
http://cpdnbeta.oerc.ox.ac.uk/forum_thread.php?id=229
Happy crunching for 2011.
____________

Forum search Site search |
|
|
|
|
|
I've passed this on to the project person for FAMOUS.
____________
Backups: Here |
|
|
|
|
|
Mavau
and others who get a Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH error message:
This is caused by one of the auxiliary files not having sufficient data to cover the full modelling period.
Those data sets still in the queue that were affected have now been removed.
Apologies for the mix up.
Also, Kraken has filled up yet again, and data needs to be moved off to storage.
Milo is not available to do this, so we wait in hope. :)
____________
Backups: Here |
|
|
|
|
|
I have a Macbook Air (1114220) that's been crunching since the middle of December. It's received nothing but FAMOUS models, and has completed 6 successfully out of 48 downloaded. I haven't checked them all, but the ones I have looked at state "INVALID THETA DETECTED". My impression from a search of the boards is that Famous models are reasonably prone to fail, but the percentage on this machine seems way too high. My question is, is there a way to conveniently exclude this machine from receiving Famous models, and if there is, should I do so? Of the 6 PCs I have on the project, this one is far and away the most "productive", when measured by credits received - over the last 5 days it's averaged 2,308.42 credits (1,154.21 per CPU core). My 2 Quad core Windows machines averaged 550 and 599 per core over the same period. The other two Core2 Duo machines (both Windows boxes) managed slightly less.
____________
Derrick Ashby |
|
|
|
|
|
Yes, you can exclude the Air from getting Famouses.
To do so:
(1) Go into "Your account" - see the blue menu on the left.
(2). Scroll down to "computers", go into this, and then into "Details" for the Air. Set the Air to be in a different 'Location' from your other computers -- say, School.
(3) Back on the "Your account" page, go into "climateprediction.net preferences". Find the link for "Add preferences for School", and in there, select the applications that you want to allow, and de-select "accept work from other applications?"
------------
The reason for high daily credit and famouses failing so frequently on Macs is that the CPDN programmers could not get the Famous application to compile without extra optimizations. The result is that Famouses run very fast but also crash more often on Macs than on other platforms.
HTH |
|
|
|
|
|
Unfortunately Famous is currently the only model type available for Mac and Linux, so if you exclude Famous you will get no work at all. |
|
|
|
|
|
64bit linux on dual core Intel
10 errored out all together, 2 probably due to reboot issues. 2 are 599 models which are known to be more prone to crashing.
u4pe1999 74gg999 ugyf1799 v3pc1899 vhcx1199 vizg1599 vizh1799 w56v599 w8y4599 w158599 Some invalid theta the rest negative pressure values.
Completed.
v3cz1799 v1b01799 va9x1799 ubdw1999 uh8d1799 which makes 5 or 1/3 completed. On my partners box winxp amd. vnt18199, the only famous unit started completed.
|
|
|