climateprediction.net home page

The world's largest climate forecasting experiment for the 21st century.

Intel Visual Fortan run-time error


Advanced search

Questions and Answers : Windows : Intel Visual Fortan run-time error

AuthorMessage
boiner_george
Send message
Joined: Jan 29 12
Posts: 2
Credit: 295,995
RAC: 3
Message 45832 - Posted 7 Apr 2013 16:45:20 UTC

    Receiving run time error after increasing disk space for this hard disk eating hog ... added another 2gig.

    After doing this, apparently, it is the only change I've made to BOINC stuff in the last couple of weeks ... or for that matter to my machine ... other then loading the latest version of Java ... I get the following.

    forrtl: sever (19) invalid reference to variable in NAMELIST

    C:\ProgramData\BOINC\projects\climateprediction.net\hadcm3n_zg88_1920 ....\climate.cpdc line 528, position 8.

    .... stack trace terminate abnormally.

    Anybody out there got a clue?

    Running Pentium i7-2600K CPU 3.5GHz, with 16 gig RAM, NVIDIA 690 video card, Windows 7 64 bit Operating System ... tons of hard disk

    Profile astroWX
    Forum moderator
    Send message
    Joined: Aug 5 04
    Posts: 1250
    Credit: 34,995,599
    RAC: 23,022
    Message 45834 - Posted 7 Apr 2013 18:58:40 UTC

      Last modified: 7 Apr 2013 19:10:28 UTC

      I had a potload of them yesterday, on different machines. Each one threw six Fortran error popups, then crashed. No pattern was noticed in the Task names but, given that it was consistent across seven Intel quads from Q6600 to i5 3550, with OSs from XP_x64 to W7_x64, I chock it up to a problem with a large chunk of the few thousand Tasks released recently. All failed to start. Work units for those with a "history" showed the same problem.

      CPDN's Data file "growth" comes from the inability of CPDN to clean-up after itself after abnormal endings. Frustrating, isn't it? (I've been remiss in cleaning-up after failures for a long time and have Data files ranging up to a ridiculous 16Meg...)

      Edit: The link in my footer no longer works: It hasn't been updated because I have hope (probably vain) that our original board will be resurrected.
      ____________
      "We have met the enemy and he is us." -- Pogo
      Greetings from coastal Washington state, the scenic US Pacific Northwest.

      Arn
      Send message
      Joined: Nov 28 07
      Posts: 1
      Credit: 509,752
      RAC: 119
      Message 45836 - Posted 7 Apr 2013 20:43:54 UTC

        I've been receiving the Intel Visual Fortran run-time error continuously for the second day now, but the error reads somewhat differently:

        forrtl: severe (19): invalid reference to variable in NAMELIST input, unit 5, file
        C:\ProgramData\BOINC\projects\climateprediction.net\hadcm3n_4cr;9_1980_40_008348863\jobs\climate.cpdc, line 529, position 0

        Image PC Routine Line Source
        hadcm3n_um_6.07_w 007D9D2A Unknown Unknown Unknown
        hadcm3n_um_6.07_w 00780B60 Unknown Unknown Unknown
        hadcm3n_um_6.07_w 0077FD3A Unknown Unknown Unknown
        hadcm3n_um_6.07_w 007648D4 Unknown Unknown Unknown
        hadcm3n_um_6.07_w 0063744C Unknown Unknown Unknown
        hadcm3n_um_6.07_w 0054C606 Unknown Unknown Unknown
        hadcm3n_um_6.07_w 0054E1A9 Unknown Unknown Unknown
        hadcm3n_um_6.07_w 006FE53B Unknown Unknown Unknown
        hadcm3n_um_6.07_w 006F3667 Unknown Unknown Unknown
        hadcm3n_um_6.07_w 004083F3 Unknown Unknown Unknown
        hadcm3n_um_6.07_w 00408130 Unknown Unknown Unknown
        kernel32.dll 773DD2E9 Unknown Unknown Unknown
        ntdll.dll 77BB1603 Unknown Unknown Unknown
        ntdll.dll 77BB15D6 Unknown Unknown Unknown

        I have ended work for Climate Prediction until I am assured no damage will result from this error. I googled this and the very first stated 'severe' must be corrected.

        Any knowledgeable assistance will be appreciated. Thanks.

        tcpk22

        Lockleys
        Send message
        Joined: Jan 13 07
        Posts: 118
        Credit: 3,770,090
        RAC: 1,973
        Message 45837 - Posted 7 Apr 2013 21:13:06 UTC

          I have just experienced a similar message set to Arn for task hadcm3n_3l4z_1980_40_008349369_2 .

          I have aborted it.

          Les Bayliss
          Forum moderator
          Send message
          Joined: Sep 5 04
          Posts: 5129
          Credit: 8,459,347
          RAC: 5,837
          Message 45838 - Posted 7 Apr 2013 21:27:57 UTC

            Arn

            All "severe" means is that the error will most likely be fatal TO THE COMPUTER PROGRAM THAT HAS HAD THIS. i.e. the climate model.
            It doesn't mean that your computer will explode, or that your teeth will turn green and your hair fall out.

            Les Bayliss
            Forum moderator
            Send message
            Joined: Sep 5 04
            Posts: 5129
            Credit: 8,459,347
            RAC: 5,837
            Message 45839 - Posted 7 Apr 2013 21:30:11 UTC

              I've had a PM about this error, as well as those reported here, so I'll let the project people know.

              Ironworker16
              Avatar
              Send message
              Joined: Jul 15 05
              Posts: 1
              Credit: 358,742
              RAC: 0
              Message 45840 - Posted 7 Apr 2013 23:02:43 UTC - in response to Message 45832.

                Last modified: 7 Apr 2013 23:03:41 UTC

                I have the same error here also. I’m Including the Error text & stderr.txt from one work unit. I'm going to suspend the project unit until there is a resolution.

                ---------------------------
                Intel(r) Visual Fortran run-time error
                ---------------------------
                forrtl: severe (19): invalid reference to variable in NAMELIST input, unit 5, file C:\ProgramData\BOINC\projects\climateprediction.net\hadcm3n_4f8c_2020_40_008348911\jobs\climate.cpdc, line 529, position 0

                Image PC Routine Line Source
                hadcm3n_um_6.07_w 007D9D2A Unknown Unknown Unknown
                hadcm3n_um_6.07_w 00780B60 Unknown Unknown Unknown
                hadcm3n_um_6.07_w 0077FD3A Unknown Unknown Unknown
                hadcm3n_um_6.07_w 007648D4 Unknown Unknown Unknown
                hadcm3n_um_6.07_w 0063744C Unknown Unknown Unknown
                hadcm3n_um_6.07_w 0054C606 Unknown Unknown Unknown
                hadcm3n_um_6.07_w 0054E1A9 Unknown Unknown Unknown
                hadcm3n_um_6.07_w 006FE53B Unknown Unknown Unknown
                hadcm3n_um_6.07_w 006FE53B Unknown Unknown Unknown
                hadcm3n_um_6.07_w 006F3667 Unknown Unknown Unknown
                hadcm3n_um_6.07_w 004083F3 Unknown Unknown Unknown
                hadcm3n_um_6.07_w 00733DBD Unknown Unknown Unknown
                ntdll.dll 772C04C0 Unknown Unknown Unknown
                ntdll.dll 772C0B1F Unknown Unknown Unknown
                ntdll.dll 772C0D5A Unknown Unknown Unknown
                ntdll.dll 772C0D5A Unknown Unknown Unknown
                ntdll.dll 772C2E92 Unknown Unknown Unknown
                ntdll.dll 772C2ED2 Unknown Unknown Unknown
                hadcm3n_um_6.07_w 007CCCEA Unknown Unknown Unknown
                ntdll.dll 772BF683 Unknown Unknown Unknown

                ---------------------------
                OK
                ---------------------------


                stderr.txt - Notepad

                04:25:22 (76312): No heartbeat from core client for 30 sec - exiting
                CPDN Monitor - No 'heartbeat' from BOINC...
                Suspended CPDN Monitor - Suspend request from BOINC...
                Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=92788, iMonCtr=1
                Model crash detected, will try to restart...
                Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=92788, iMonCtr=1
                Model crash detected, will try to restart...
                Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=92788, iMonCtr=1
                Model crash detected, will try to restart...
                Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=92788, iMonCtr=1
                Model crash detected, will try to restart...
                Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=92788, iMonCtr=1
                Model crash detected, will try to restart...
                Suspended CPDN Monitor - Suspend request from BOINC...



                Running core i7-920 CPU , with 12 gig RAM, Radeon HD 7970 video card, Windows 8 64 bit Operating System ... tons of hard disk

                Profile mo.v
                Forum moderator
                Avatar
                Send message
                Joined: Sep 29 04
                Posts: 2354
                Credit: 6,492,442
                RAC: 2,144
                Message 45841 - Posted 8 Apr 2013 1:15:55 UTC

                  Last modified: 8 Apr 2013 1:17:07 UTC

                  Thanks to everyone for your reports. The reason the errors say Visual Fortran is that this is the language the climate models are written in. Here is a list of Fortran Run-Time error codes with very brief descriptions of their meanings.

                  I had downloaded three new models yesterday, Sunday, but they hadn't begun to run. So I suspended some models already running to make the new ones start. Here's what happened:

                  Within seconds of starting each of the three models threw a Visual Fortran Runtime error just like the ones members have already quoted. Two models starting in 1980 said the error was in line 529 in position 0, whereas the model starting in 1920 said line 528 in position 8.

                  I left the models running and opened Windows Event Viewer to see whether the three runtime errors were recorded there. I could find no trace of these errors either by name or by timestamp. They appeared to have had no effect on the running of the computer.

                  I then looked at the Fortran error page again and noticed that 'with severe, program execution stops (unless a recovery method is specified)'. My models still seemed to be running in the sense that they were still clocking up time. I opened the graphics window for each of them to see how they were advancing and found that all three were stopped at timestep No 1 and showed completely blue globes. Blue is the default colour and means that computation never started.

                  I checked in Windows Task Manager Performance tab to see whether these models were using CPU time (and energy/electricity) and found that they were idle ie costing no energy.

                  As these models are not advancing I'm going to abort them and get new ones. But if the new ones belong to the same batch they will probably throw the same error.

                  Visual Fortran Runtime errors have never in the past done any harm to our computers. As Les has said, this error is restricted to the model in question. It looks scary because of the cross in the red circle but is harmless to everything except the models. Look at the graphics to see whether they're really processing and if they're not, please abort them.


                  ____________
                  Cpdn news
                  5 CPDN READMEs

                  Les Bayliss
                  Forum moderator
                  Send message
                  Joined: Sep 5 04
                  Posts: 5129
                  Credit: 8,459,347
                  RAC: 5,837
                  Message 45842 - Posted 8 Apr 2013 16:15:13 UTC

                    OK, the problem has been traced to an incorrect line, (1 of hundreds), in one of the many files that go to make up data sets to start these models.
                    This has been fixed, and the faulty data sets will be re-issued.

                    Thank goodness people buy cars assembled, and don't get dozens of boxes of various shapes and sizes with parts that they then have to assemble themselves. With the instructions, no doubt, in the language of origin of the parts makers.
                    :)


                    zombie67 [MM]
                    Avatar
                    Send message
                    Joined: Oct 2 06
                    Posts: 2
                    Credit: 2,295,429
                    RAC: 1
                    Message 45845 - Posted 8 Apr 2013 22:48:37 UTC

                      I received several of these too. Will the bad tasks be aborted server-side?
                      ____________

                      Les Bayliss
                      Forum moderator
                      Send message
                      Joined: Sep 5 04
                      Posts: 5129
                      Credit: 8,459,347
                      RAC: 5,837
                      Message 45846 - Posted 9 Apr 2013 0:05:50 UTC - in response to Message 45845.

                        Bad tasks on Macs and Linux should self abort very quickly.
                        On Windows it may be a different matter. It's possible they may sit there pretending to run but not clocking up any progress in the various lines in the Show Graphics window. We're still talking about this. (Very slowly, due to time zone differences, and the loss of our php board.)
                        My 2 are from December so they aren't affected, and I have to go by second hand information.


                        Profile mo.v
                        Forum moderator
                        Avatar
                        Send message
                        Joined: Sep 29 04
                        Posts: 2354
                        Credit: 6,492,442
                        RAC: 2,144
                        Message 45847 - Posted 9 Apr 2013 0:10:08 UTC

                          Last modified: 9 Apr 2013 0:11:30 UTC

                          Hi Zombie

                          To my knowledge, tasks already sent to computers won't be aborted from the server. This was done once before but the killer message was sent from the server to the computer when the model's next trickle was uploaded. But AFAIK this can't be done with the current models because although they're accumulating runtime they are making no progress and will never reach the end of their first year which is when they would normally trickle up and make contact with the server.

                          I get the impression from looking at a lot of these models' task and WU web pages that on Darwin and Linux many of the models crash of their own accord. They don't all crash on Windows. On my own Windows machine three of these models accumulated runtime for well over an hour without making progress, using CPU time or crashing. Other longer periods have been reported in this thread.

                          I think a lot of these models are still stuck on computers. Not using electricity but hogging CPU cores that could be crunching usefully. Please abort them. I know this is tedious for members who have a lot of computers.

                          I see Les got there first but I'll leave my comments anyway
                          ____________
                          Cpdn news
                          5 CPDN READMEs

                          zombie67 [MM]
                          Avatar
                          Send message
                          Joined: Oct 2 06
                          Posts: 2
                          Credit: 2,295,429
                          RAC: 1
                          Message 45848 - Posted 9 Apr 2013 2:40:03 UTC

                            Yes, I am talking about windows machines here.

                            But the bad tasks should be aborted from the server-side, all the same. The machine will likely contact the server to fill a different thread slot, and would then learn to kill the task.

                            There is no reason to *not* kill those bad tasks from the server side:

                            If *nix: They die anyway
                            if Win: They need to be killed anyway.
                            ____________

                            Les Bayliss
                            Forum moderator
                            Send message
                            Joined: Sep 5 04
                            Posts: 5129
                            Credit: 8,459,347
                            RAC: 5,837
                            Message 45849 - Posted 9 Apr 2013 3:07:19 UTC - in response to Message 45848.

                              For the "killer trickle" to be sent to the correct target, that target, i.e. climate model, needs to return a trickle_up file for the server to find it.
                              As has been said, this is unlikely to happen, so they CAN'T be killed from the server.
                              As has also been said, we're still talking about this, but it'll be a few hours yet before the Oxford people are back at work to get the latest messages that have been sent to them.


                              Profile JIM
                              Send message
                              Joined: Dec 31 07
                              Posts: 609
                              Credit: 3,342,044
                              RAC: 4,746
                              Message 45850 - Posted 9 Apr 2013 7:12:47 UTC - in response to Message 45842.

                                Thank goodness people buy cars assembled, and don't get dozens of boxes of various shapes and sizes with parts that they then have to assemble themselves. With the instructions, no doubt, in the language of origin of the parts makers. Thank goodness people buy cars assembled, and don't get dozens of boxes of various shapes and sizes with parts that they then have to assemble themselves. With the instructions, no doubt, in the language of origin of the parts makers.

                                Strangely, while you don’t buy cars that way you can buy airplanes. People buy disassembled kits that they have to put together themselves. Then they get in and fly them. Frightening isn’t it.

                                ____________

                                Profile Dave Jackson
                                Send message
                                Joined: May 15 09
                                Posts: 605
                                Credit: 581,731
                                RAC: 157
                                Message 45851 - Posted 9 Apr 2013 7:31:09 UTC - in response to Message 45850.

                                  Wouldn't know about the instructions bit - I only rtfm when something doesn't work.

                                  Ingleside
                                  Send message
                                  Joined: Aug 5 04
                                  Posts: 85
                                  Credit: 6,840,914
                                  RAC: 11,870
                                  Message 45916 - Posted 13 Apr 2013 1:07:15 UTC - in response to Message 45849.

                                    For the "killer trickle" to be sent to the correct target, that target, i.e. climate model, needs to return a trickle_up file for the server to find it.
                                    As has been said, this is unlikely to happen, so they CAN'T be killed from the server.

                                    Aborting tasks without relying on trickle-messages has been part of BOINC since around BOINC-Client v5.10.x.

                                    MichaelO
                                    Send message
                                    Joined: Aug 8 05
                                    Posts: 5
                                    Credit: 7,573,387
                                    RAC: 10,065
                                    Message 45949 - Posted 16 Apr 2013 20:22:35 UTC

                                      Great discussion...I was concerned I was doing something wrong.

                                      However, after aborting tasks behaving like those described, one machine I have has not received any further tasks. Is this likely an unrelated issue? I.e., could aborting the tasks with errors 'flag' my machine so the server now ignores it?
                                      ____________

                                      Les Bayliss
                                      Forum moderator
                                      Send message
                                      Joined: Sep 5 04
                                      Posts: 5129
                                      Credit: 8,459,347
                                      RAC: 5,837
                                      Message 45950 - Posted 16 Apr 2013 20:49:46 UTC - in response to Message 45949.

                                        This project often has long periods of no work. This is one of them.
                                        There was a small batch of these models released to test the MD5 problem, but that may be it for a while.

                                        See the Server Status page for what's available. Blue menu to the left, 5 from the bottom.

                                        Pete(r) van der Spoel
                                        Send message
                                        Joined: Aug 5 04
                                        Posts: 6
                                        Credit: 3,176,969
                                        RAC: 2,332
                                        Message 45979 - Posted 19 Apr 2013 14:09:16 UTC - in response to Message 45842.

                                          OK, the problem has been traced to an incorrect line, (1 of hundreds), in one of the many files that go to make up data sets to start these models.
                                          This has been fixed, and the faulty data sets will be re-issued.


                                          Does this happen automatically or do I need to abort the tasks? I've been getting these errors since yesterday but the progress % keeps creeping up and the graphics confirm that the tasks still seem to be progressing (colour pattern changes).

                                          ____________

                                          Pete(r) van der Spoel
                                          Send message
                                          Joined: Aug 5 04
                                          Posts: 6
                                          Credit: 3,176,969
                                          RAC: 2,332
                                          Message 45980 - Posted 19 Apr 2013 14:13:36 UTC - in response to Message 45979.

                                            Sorry, my bad for not looking properly. The error's were all about the task I'd just downloaded and which was actually stuck at 0%. The other are running fine so I'll just abort that one task...
                                            ____________

                                            Profile mo.v
                                            Forum moderator
                                            Avatar
                                            Send message
                                            Joined: Sep 29 04
                                            Posts: 2354
                                            Credit: 6,492,442
                                            RAC: 2,144
                                            Message 46002 - Posted 21 Apr 2013 0:37:25 UTC

                                              Pete, if the model's progress is stuck at 0% please abort it.
                                              ____________
                                              Cpdn news
                                              5 CPDN READMEs

                                              Stuart
                                              Send message
                                              Joined: Jan 2 11
                                              Posts: 4
                                              Credit: 166,701
                                              RAC: 0
                                              Message 46119 - Posted 29 Apr 2013 21:51:43 UTC

                                                Hello,

                                                Same error here, been cropping up over the last few weeks - had been aborting "bad" simulations but now its happening more often.

                                                I dont appear able to copy and paste the error message and its alot to type!

                                                Just aborted another task which was showing "computation error", task hadcm3n_3j00_1980_40_008352515

                                                It was reported as having had 9h 55m 10s computation time which on my PC is around 2% completion.

                                                Ive not really looked in detail to see when the others have failed in case there is a pattern.

                                                Re-installed Boinc 7.0.64 for windows 64.

                                                I run Windows 7 ultimate, 18Gb ram, i7 930 2.8Ghz which has been year long stable at 3.36GHz with an nVidia GTX 570 graphics card.

                                                I only run the climate prediction and also GPU grid on boinc.

                                                Hope this helps somebody fix things?

                                                Stuart

                                                Profile mo.v
                                                Forum moderator
                                                Avatar
                                                Send message
                                                Joined: Sep 29 04
                                                Posts: 2354
                                                Credit: 6,492,442
                                                RAC: 2,144
                                                Message 46122 - Posted 29 Apr 2013 22:27:05 UTC

                                                  Last modified: 29 Apr 2013 22:28:07 UTC

                                                  Hi Stuart

                                                  Thank you for the report. The model certainly didn't fail because of a shortage of RAM on your computer, did it?

                                                  I hate to have to tell you that model hadcm3n_3bzy_1980_40_008349731 on your computer will also have to be aborted. If you see that any model in the same workunit has crashed with Exit status -529697949, please abort it straightaway.
                                                  ____________
                                                  Cpdn news
                                                  5 CPDN READMEs

                                                  Les Bayliss
                                                  Forum moderator
                                                  Send message
                                                  Joined: Sep 5 04
                                                  Posts: 5129
                                                  Credit: 8,459,347
                                                  RAC: 5,837
                                                  Message 46135 - Posted 30 Apr 2013 20:26:22 UTC

                                                    It's thought that the source of the FORTRAN errors has been found, so a small test batch was released.
                                                    These were grabbed immediately, and are apparently running OK.


                                                    ____________
                                                    Backups: Here

                                                    Ron Schroeder
                                                    Send message
                                                    Joined: May 5 07
                                                    Posts: 1
                                                    Credit: 1,935,473
                                                    RAC: 2,549
                                                    Message 46245 - Posted 16 May 2013 21:02:12 UTC

                                                      I have a Fortran error running

                                                      hadcm3n_4db9_1980_40_008348264_2
                                                      on a Dual Quad-Xeon processors, Windows 7 x64, all patched etc. The other 7 Projects are running without problems. I aborted this one.

                                                      Profile MikeMarsUK
                                                      Forum moderator
                                                      Avatar
                                                      Send message
                                                      Joined: Jan 13 06
                                                      Posts: 1498
                                                      Credit: 6,806,826
                                                      RAC: 5,065
                                                      Message 46246 - Posted 16 May 2013 22:21:20 UTC - in response to Message 46245.

                                                        I have a Fortran error running
                                                        hadcm3n_4db9_1980_40_008348264_2
                                                        on a Dual Quad-Xeon processors, Windows 7 x64, all patched etc. The other 7 Projects are running without problems. I aborted this one.


                                                        http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=8499125

                                                        Yes, it shows all the hallmarks of being a bad workunit. Aborting it is the right thing to do :-)




                                                        ____________
                                                        I'm a volunteer and my views are my own.
                                                        News and Announcements and FAQ

                                                        deadsenator
                                                        Send message
                                                        Joined: Aug 6 08
                                                        Posts: 3
                                                        Credit: 16,417,183
                                                        RAC: 15,398
                                                        Message 46643 - Posted 19 Jul 2013 3:44:22 UTC

                                                          After a spate of these a few months ago, I am getting this error again.

                                                          The workunits then show up as a computation error in Boinc. Unlike what I have read here, some of my errors come from failed workunits that are 600+ hours in. 97% complete and blammo.

                                                          Profile Iain Inglis
                                                          Forum moderator
                                                          Send message
                                                          Joined: Jan 16 10
                                                          Posts: 408
                                                          Credit: 9,532
                                                          RAC: 0
                                                          Message 46644 - Posted 19 Jul 2013 8:31:59 UTC - in response to Message 46643.

                                                            After a spate of these a few months ago, I am getting this error again.

                                                            The workunits then show up as a computation error in Boinc. Unlike what I have read here, some of my errors come from failed workunits that are 600+ hours in. 97% complete and blammo.

                                                            The machines you have are very powerful ones indeed, but the HADCM3N model is also large. Attempting to run 20 of them on any machine is likely to result in a significant failure rate. This type of model is particularly sensitive at the decade upload point (i.e. 25%, 50% etc.). The FORTRAN error is usually a sign of competition for resources, which will be a precursor to failure for HADCM3N.

                                                            The Xeon E5645, for example, has hyperthreading. The model completion rate might improve by limiting the number of CPUs in BOINC to the number of cores, which won't greatly affect the throughput as hyperthreading only gives a 20% or so advantage.

                                                            Profile MikeMarsUK
                                                            Forum moderator
                                                            Avatar
                                                            Send message
                                                            Joined: Jan 13 06
                                                            Posts: 1498
                                                            Credit: 6,806,826
                                                            RAC: 5,065
                                                            Message 46647 - Posted 19 Jul 2013 13:30:59 UTC - in response to Message 45846.

                                                              ... On Windows it may be a different matter. It's possible they may sit there pretending to run but not clocking up any progress ...


                                                              This will be interesting ... I downloaded a bunch yesterday after the servers came back, and I am away from home for 9 days. Unfortunate timing.

                                                              ____________
                                                              I'm a volunteer and my views are my own.
                                                              News and Announcements and FAQ

                                                              deadsenator
                                                              Send message
                                                              Joined: Aug 6 08
                                                              Posts: 3
                                                              Credit: 16,417,183
                                                              RAC: 15,398
                                                              Message 46649 - Posted 19 Jul 2013 16:59:32 UTC - in response to Message 46644.


                                                                The machines you have are very powerful ones indeed, but the HADCM3N model is also large. Attempting to run 20 of them on any machine is likely to result in a significant failure rate. This type of model is particularly sensitive at the decade upload point (i.e. 25%, 50% etc.). The FORTRAN error is usually a sign of competition for resources, which will be a precursor to failure for HADCM3N.

                                                                The Xeon E5645, for example, has hyperthreading. The model completion rate might improve by limiting the number of CPUs in BOINC to the number of cores, which won't greatly affect the throughput as hyperthreading only gives a 20% or so advantage.


                                                                Thank you for your input, Iain. I have never before had any significant error rates and the system runs fine normally, except for the aforementioned spikes in errors back in Spring and the recent set that I've mentioned. This last error was on a small WU and died at 0%, but that seems to have been the exception for me. I did not do a thorough analysis, but most of my previous failed WUs then had been month-long exercises that failed towards the end. Perhaps because of the sensitive upload point you've mentioned.

                                                                I am somewhat confused by your statements above about model completion rate improving by limiting the cores (I presume you meant to only real cores), but then you state that HT gives a 20% advantage. I understand how HT works, but I am just asking for clarification about Climate WU processing efficiency. It is my experience that this type of processing is enhanced by using as many cores as possible. Whether they are HT or not, overall wall-clock time is reduced for the job. Somewhere along the lines of that 20%. This is significant in my opinion, but I also recognize that successful WU completion is the goal.

                                                                If the errors persist I will look to implement your advice, but I am initially reluctant to limit the cores and reduce my intended work unit production. I have made one change, and that is to keep the WU in memory when suspended. I feel foolish for not setting this before as some of those earlier errors seem to hit when re-activating the client.

                                                                Thank you again.

                                                                Les Bayliss
                                                                Forum moderator
                                                                Send message
                                                                Joined: Sep 5 04
                                                                Posts: 5129
                                                                Credit: 8,459,347
                                                                RAC: 5,837
                                                                Message 46650 - Posted 19 Jul 2013 21:42:07 UTC - in response to Message 46649.

                                                                  As a rough rule of thumb, it has in the past been considered that the hadcmn3 models need 1 gig of ram each.
                                                                  So, 20 models, 20 gigs, plus some more for the OS.

                                                                  Profile Greg van Paassen
                                                                  Send message
                                                                  Joined: Nov 17 07
                                                                  Posts: 131
                                                                  Credit: 3,745,919
                                                                  RAC: 4,674
                                                                  Message 46651 - Posted 19 Jul 2013 22:55:40 UTC - in response to Message 46649.

                                                                    I understand how HT works, but I am just asking for clarification about Climate WU processing efficiency. It is my experience that this type of processing is enhanced by using as many cores as possible. Whether they are HT or not, overall wall-clock time is reduced for the job. Somewhere along the lines of that 20%. This is significant in my opinion, but I also recognize that successful WU completion is the goal.

                                                                    Just to be clear, WUs are single-threaded.

                                                                    On my machine (core i7 SNB, 4 cores, 8 threads), with 4 models running concurrently, each takes about 1.0 seconds per time step (s/ts). With 8 running, each takes about 1.5 s/ts. Doing the arithmetic, doubling the number of WUs running concurrently increases total throughput by one third. It also increases the clock time required to complete any one WU by half.

                                                                    So with hyperthreading, machines get more done in a year, but each individual WU takes longer to finish.

                                                                    HadCM3Ns seem to be sensitive to disk i/o congestion--"impatient". Running fewer models reduces the probability of a "disk traffic jam" causing a model to crash because a disk read or write didn't complete quickly enough. (I think this is what Iain meant about model completion rates.) The degree of impatience seems to vary between different batches of HadCM3Ns.

                                                                    (For an idea of the numbers: on my machine, at 1.5 s/ts, each model averages about 0.85 MB/s continual disk writing, with spikes up to 7 MB/s during checkpoints (every 72 time steps). During the decadal zip-file uploads, disk activity goes as high as the disk system will support (over 65 MB/s reads and 35 MB/s writes at the same time) for a few seconds.)

                                                                    Profile Iain Inglis
                                                                    Forum moderator
                                                                    Send message
                                                                    Joined: Jan 16 10
                                                                    Posts: 408
                                                                    Credit: 9,532
                                                                    RAC: 0
                                                                    Message 46652 - Posted 19 Jul 2013 23:02:56 UTC - in response to Message 46649.

                                                                      I am somewhat confused by your statements above about model completion rate improving by limiting the cores (I presume you meant to only real cores), but then you state that HT gives a 20% advantage. I understand how HT works, but I am just asking for clarification about Climate WU processing efficiency. It is my experience that this type of processing is enhanced by using as many cores as possible. Whether they are HT or not, overall wall-clock time is reduced for the job. Somewhere along the lines of that 20%. This is significant in my opinion, but I also recognize that successful WU completion is the goal.

                                                                      A distinction needs to be made between the effect that HT has on machine throughput and the effect that it has on the time taken to complete each task. Assuming all the HT pseudo-cores are running tasks, then HT increases throughput (for which credits/RAC are a suitable metric) but decreases the rate at which each task progresses - i.e. each task takes longer to complete, almost double the time. The cause of the HADCM3N decadal sensitivity is thought to be a timing error: in other words, the sequence of actions the various parts of a HADCM3N task needs to perform gets messed up, so when the Zip file comes to be created the required files aren't there, so the model crashes. Slowing a task down might, in principle, work either way: it could reduce the probability of the timing/sequencing error or it could increase that probability, depending on what precisely the error is. So when I say "completion rate" I don't mean the rate of progress a model makes whether it completes or not, I mean the proportion of viable models that actually finish. If all models completed then the throughput would measure both the rate of progress and the rate of completion; in the presence of errors the two rates diverge.

                                                                      My own experience is that leaving HADCM3N models entirely undisturbed reduces the error rate to zero - i.e. I have had no failures at all since leaving them alone. (There will, of course, be "invalid theta" and other physics errors, and download errors on occasion too; there's nothing we volunteers can do about that.) My prejudice is therefore that the HT process represents "disturbance" and is to be discouraged: it is, however, merely a prejudice: almost all the machines to which I have access have been running work simulations solidly for six months, so I have simply not been running CPDN nor have I tested HT/multi-core/HADCM3N interactions since being told about the timing error (at the Guardian University Awards in February). Unfortunately, the real world does intrude from time to time.

                                                                      I know we say that incomplete models are useful to the project: it's a true but nonetheless rather lawyerly evasion - it's just got to be better for the project to get complete models.

                                                                      deadsenator
                                                                      Send message
                                                                      Joined: Aug 6 08
                                                                      Posts: 3
                                                                      Credit: 16,417,183
                                                                      RAC: 15,398
                                                                      Message 46654 - Posted 20 Jul 2013 1:44:38 UTC - in response to Message 46652.

                                                                        As a rough rule of thumb, it has in the past been considered that the hadcmn3 models need 1 gig of ram each.
                                                                        So, 20 models, 20 gigs, plus some more for the OS.


                                                                        Les, I did not know that. Apparently 12GB isn't enough, so you've given me a great reason to add more RAM. Thanks!


                                                                        Just to be clear, WUs are single-threaded...So with hyperthreading, machines get more done in a year, but each individual WU takes longer to finish.


                                                                        Thank you, Greg. Yes, I know about WUs being single threaded and what you've stated aligns with how I understand it. I consider a "job" not to be just one WU, but the entire model being crunched. So, yes the time per WU increases, but since you are processing more WUs overall, the total job time will be reduced.

                                                                        HadCM3Ns seem to be sensitive to disk i/o congestion--"impatient". Running fewer models reduces the probability of a "disk traffic jam" causing a model to crash because a disk read or write didn't complete quickly enough. (I think this is what Iain meant about model completion rates.) The degree of impatience seems to vary between different batches of HadCM3Ns.


                                                                        Well, I am using an SSD drive (Samsung 840), so that should help, but your point is a good one. The takeaway for me is resource contention can occur at each level (CPU, RAM and disk) and the code is very sensitive to this.

                                                                        ... My own experience is that leaving HADCM3N models entirely undisturbed reduces the error rate to zero


                                                                        Iain, this is echoing similar experiences I have had. After shutting down Boinc, ramping back up can be a tenuous experience and this is when I have experienced some problems. I have cut back on the number of interruptions and I have made the memory setting change I stated above in the attempt to quell any potential disturbances. Unfortunately, as a pesky human, I like to use this machine for other things too on occasion. I didn't build it *only* for Boinc.

                                                                        Your thoughts regarding HT are noted and certainly could come into play with the instability we've discussed. I'll take the opposite track and continue to use it as I have not experienced any consistent instability that I could tie to such a global environment setting. Additionally, it seems to be only this system that experiences these errors. The other two don't seem to crash WUs, but are using HT. Perhaps if the errors continue, I will test your solution.

                                                                        In addition to leaving the WU in memory, what I will do is look to increasing my RAM and see if this helps with resource contention.

                                                                        Thank you all for your help. Your input is highly valued.

                                                                        Eirik Redd
                                                                        Send message
                                                                        Joined: Aug 31 04
                                                                        Posts: 193
                                                                        Credit: 23,214,844
                                                                        RAC: 31,385
                                                                        Message 46656 - Posted 20 Jul 2013 5:51:55 UTC

                                                                          I can't prove this - working on stats - but preliminary indications here --
                                                                          If you have more than four cores -- leaving one of them free for the OS - might help total throughput -- just a thought - I'm not sure but seems to work here.
                                                                          ____________

                                                                          Profile Norman Guinasso
                                                                          Send message
                                                                          Joined: Jan 28 05
                                                                          Posts: 2
                                                                          Credit: 972,460
                                                                          RAC: 0
                                                                          Message 46687 - Posted 24 Jul 2013 1:47:51 UTC

                                                                            getting visual fortran run time error w7
                                                                            I have not changed anything.
                                                                            I cannot delete error window.
                                                                            ____________

                                                                            Profile JIM
                                                                            Send message
                                                                            Joined: Dec 31 07
                                                                            Posts: 609
                                                                            Credit: 3,342,044
                                                                            RAC: 4,746
                                                                            Message 46689 - Posted 24 Jul 2013 3:29:58 UTC - in response to Message 46687.

                                                                              getting visual fortran run time error w7
                                                                              I have not changed anything.
                                                                              I cannot delete error window.


                                                                              Have you tried exiting the model or models, closing down Boinc and rebooting? The WU that is throwing errors will most likely crash, but, there is nothing that can be done about that. At least it will allow you to delete the error message window. The Wu is most likely non-viable anyway.

                                                                              ____________

                                                                              Profile Iain Inglis
                                                                              Forum moderator
                                                                              Send message
                                                                              Joined: Jan 16 10
                                                                              Posts: 408
                                                                              Credit: 9,532
                                                                              RAC: 0
                                                                              Message 46693 - Posted 24 Jul 2013 12:01:27 UTC - in response to Message 46687.

                                                                                getting visual fortran run time error w7
                                                                                I have not changed anything.
                                                                                I cannot delete error window.

                                                                                The only time a model running on a machine of mine produced a sequence of these FORTRAN errors, there was an unrelated process running 100% in the background (a berserk printer driver). Killing that other process first saved the CPDN model, though that was pre-HADCM3N. HADCM3N models do not seem very robust, so JIM is probably right: the model may now fail whatever you do ...

                                                                                Profile astroWX
                                                                                Forum moderator
                                                                                Send message
                                                                                Joined: Aug 5 04
                                                                                Posts: 1250
                                                                                Credit: 34,995,599
                                                                                RAC: 23,022
                                                                                Message 46697 - Posted 24 Jul 2013 16:55:31 UTC - in response to Message 46687.

                                                                                  getting visual fortran run time error w7
                                                                                  I have not changed anything.
                                                                                  I cannot delete error window.

                                                                                  A friend had two tasks throw Fortran errors at about the same time. She held them for me to see. We tried to salvage the tasks, in case they were the 'soft' type (irritations, but not fatal), to no avail. Both failed.

                                                                                  Six Fortran error popups are thrown by each failed task of that type. It seems, at the time, that we can't get rid of the things (doubly so for twelve popups with two simultaneous failures).

                                                                                  'Luck of the draw' whether we inherit reruns of old, flawed, tasks.

                                                                                  ____________
                                                                                  "We have met the enemy and he is us." -- Pogo
                                                                                  Greetings from coastal Washington state, the scenic US Pacific Northwest.

                                                                                  Profile Norman Guinasso
                                                                                  Send message
                                                                                  Joined: Jan 28 05
                                                                                  Posts: 2
                                                                                  Credit: 972,460
                                                                                  RAC: 0
                                                                                  Message 46700 - Posted 24 Jul 2013 22:50:23 UTC

                                                                                    Your program is crashing my computer. It displays a greyed out window and locks the computer. I have to power off the computer and reboot to get out of this. I have suspended running climate models until you tell me how to fix this. I have been running these programs for years without problems.
                                                                                    Norman Guinasso

                                                                                    ____________

                                                                                    Profile astroWX
                                                                                    Forum moderator
                                                                                    Send message
                                                                                    Joined: Aug 5 04
                                                                                    Posts: 1250
                                                                                    Credit: 34,995,599
                                                                                    RAC: 23,022
                                                                                    Message 46704 - Posted 25 Jul 2013 0:06:38 UTC

                                                                                      Crashing? Are you sure it isn't a case of failure to respond? (BOINC loses track of its other half.) I see it often (too often!) and it usually reconnects if ignored. How long do you wait before deciding your machine is 'crashed'?

                                                                                      Thread: This isn't a Visual Fortran issue...

                                                                                      ____________
                                                                                      "We have met the enemy and he is us." -- Pogo
                                                                                      Greetings from coastal Washington state, the scenic US Pacific Northwest.

                                                                                      Eirik Redd
                                                                                      Send message
                                                                                      Joined: Aug 31 04
                                                                                      Posts: 193
                                                                                      Credit: 23,214,844
                                                                                      RAC: 31,385
                                                                                      Message 46716 - Posted 27 Jul 2013 12:50:41 UTC - in response to Message 46700.

                                                                                        Of the two recent failed tasks - one was a malformed workunit that got a C++ alloc error when the deformed workunit tried to allocate more memory than existed. There have been a few of these malformed models, most of them have been tried and failed several times on various computers and are unlikely (we hope) to be re-issued.

                                                                                        The other one that failed on the same machine -- never seen the like.

                                                                                        BUT -- both failures seem to be rare, sorry you got hit with two of them -- most workunits lately have been working well.

                                                                                        Please try another - I see you have some already downloaded. It is unlikely to get another bad one -- and if you do, keep complaining.
                                                                                        ____________

                                                                                        fred
                                                                                        Send message
                                                                                        Joined: Sep 29 13
                                                                                        Posts: 2
                                                                                        Credit: 51,429
                                                                                        RAC: 0
                                                                                        Message 47938 - Posted 7 Jan 2014 4:14:54 UTC

                                                                                          I'm running BOINC Manager Version 7.2.33 (x64), wxWidgets Version 2.8.10 on Windows Vista 64 bit version, and am getting a Visual Fortran run-time error which seems to be associated with climateprediction.net workunit hadcm3n_3ih_1980_40_008348465 starting. The other project being undertaken is malariacontrol.net, but only climate prediction is working when the problem presents. The error window will only go away when I close BOINC Manager.

                                                                                          7/01/2014 1:42:23 AM | | cc_config.xml not found - using defaults
                                                                                          7/01/2014 1:42:23 AM | | Starting BOINC client version 7.2.33 for windows_x86_64
                                                                                          7/01/2014 1:42:23 AM | | log flags: file_xfer, sched_ops, task
                                                                                          7/01/2014 1:42:23 AM | | Libraries: libcurl/7.25.0 OpenSSL/1.0.1 zlib/1.2.6
                                                                                          7/01/2014 1:42:23 AM | | Data directory: D:\ProgramData\BOINC
                                                                                          7/01/2014 1:42:23 AM | | Running under account fred
                                                                                          7/01/2014 1:42:23 AM | | CAL: ATI GPU 0: AMD Radeon HD 6790/6850/6870 series (Barts) (CAL version 1.4.1741, 1024MB, 991MB available, 3149 GFLOPS peak)
                                                                                          7/01/2014 1:42:23 AM | | OpenCL: AMD/ATI GPU 0: AMD Radeon HD 6790/6850/6870 series (Barts) (driver version 1084.4 (VM), device version OpenCL 1.2 AMD-APP (1084.4), 1024MB, 991MB available, 3149 GFLOPS peak)
                                                                                          7/01/2014 1:42:23 AM | | OpenCL CPU: Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz (OpenCL driver vendor: Advanced Micro Devices, Inc., driver version 1084.4 (sse2), device version OpenCL 1.2 AMD-APP (1084.4))
                                                                                          7/01/2014 1:42:23 AM | | Host name: Obelisk
                                                                                          7/01/2014 1:42:23 AM | | Processor: 2 GenuineIntel Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz [Family 6 Model 23 Stepping 6]
                                                                                          7/01/2014 1:42:23 AM | | Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss htt tm pni ssse3 cx16 sse4_1 syscall nx lm vmx smx tm2 pbe
                                                                                          7/01/2014 1:42:23 AM | | OS: Microsoft Windows Vista: Home Premium x64 Edition, Service Pack 2, (06.00.6002.00)
                                                                                          7/01/2014 1:42:23 AM | | Memory: 4.00 GB physical, 11.97 GB virtual
                                                                                          7/01/2014 1:42:23 AM | | Disk: 298.09 GB total, 175.21 GB free
                                                                                          7/01/2014 1:42:23 AM | | Local time is UTC +10 hours
                                                                                          7/01/2014 1:42:23 AM | climateprediction.net | URL http://climateprediction.net/; Computer ID 1295373; resource share 100
                                                                                          7/01/2014 1:42:23 AM | malariacontrol.net | URL http://www.malariacontrol.net/; Computer ID 380102; resource share 100
                                                                                          7/01/2014 1:42:23 AM | malariacontrol.net | General prefs: from malariacontrol.net (last modified 30-Oct-2011 18:56:32)
                                                                                          7/01/2014 1:42:23 AM | malariacontrol.net | Host location: none
                                                                                          7/01/2014 1:42:23 AM | malariacontrol.net | General prefs: using your defaults
                                                                                          7/01/2014 1:42:23 AM | | Reading preferences override file
                                                                                          7/01/2014 1:42:23 AM | | Preferences:
                                                                                          7/01/2014 1:42:23 AM | | max memory usage when active: 4093.58MB
                                                                                          7/01/2014 1:42:23 AM | | max memory usage when idle: 3684.22MB
                                                                                          7/01/2014 1:42:23 AM | | max disk usage: 10.00GB
                                                                                          7/01/2014 1:42:23 AM | | don't compute while active
                                                                                          7/01/2014 1:42:23 AM | | don't use GPU while active
                                                                                          7/01/2014 1:42:23 AM | | suspend work if non-BOINC CPU load exceeds 20%
                                                                                          7/01/2014 1:42:23 AM | | (to change preferences, visit a project web site or select Preferences in the Manager)
                                                                                          7/01/2014 1:42:23 AM | | Not using a proxy
                                                                                          7/01/2014 1:42:24 AM | | Suspending computation - computer is in use
                                                                                          7/01/2014 1:42:24 AM | | Suspending network activity - computer is in use
                                                                                          7/01/2014 1:42:54 AM | | Resuming network activity
                                                                                          7/01/2014 1:42:54 AM | climateprediction.net | Restarting task hadcm3n_obum_1900_40_008470481_1 using hadcm3n version 607 in slot 1
                                                                                          7/01/2014 1:42:54 AM | climateprediction.net | Restarting task hadcm3n_3ilh_1980_40_008348465_4 using hadcm3n version 607 in slot 2
                                                                                          7/01/2014 1:42:54 AM | malariacontrol.net | Sending scheduler request: To fetch work.
                                                                                          7/01/2014 1:42:54 AM | malariacontrol.net | Requesting new tasks for ATI
                                                                                          7/01/2014 1:42:56 AM | | Suspending computation - computer is in use
                                                                                          7/01/2014 1:42:56 AM | | Suspending network activity - computer is in use
                                                                                          7/01/2014 1:46:48 AM | | Resuming computation
                                                                                          7/01/2014 1:46:48 AM | | Resuming network activity
                                                                                          7/01/2014 1:46:48 AM | malariacontrol.net | Scheduler request completed: got 0 new tasks
                                                                                          7/01/2014 1:46:48 AM | malariacontrol.net | No work sent
                                                                                          7/01/2014 1:46:51 AM | | Suspending computation - computer is in use
                                                                                          7/01/2014 1:46:51 AM | | Suspending network activity - computer is in use

                                                                                          http://s240.photobucket.com/user/expie_photos/media/run%20time%20errors/VisualFortranRun-TimeError_zps212fac84.jpg.html

                                                                                          Profile Dave Jackson
                                                                                          Send message
                                                                                          Joined: May 15 09
                                                                                          Posts: 605
                                                                                          Credit: 581,731
                                                                                          RAC: 157
                                                                                          Message 47939 - Posted 7 Jan 2014 10:25:32 UTC

                                                                                            The task has failed for three others, I think you will be it's last failure. before it is assigned to the dustbin. I note it still shows as running so you may need to manually abort it. All the other tasks in the work unit have also failed from the look of it.

                                                                                            fred
                                                                                            Send message
                                                                                            Joined: Sep 29 13
                                                                                            Posts: 2
                                                                                            Credit: 51,429
                                                                                            RAC: 0
                                                                                            Message 47940 - Posted 7 Jan 2014 12:48:53 UTC - in response to Message 47939.

                                                                                              OK, done. Thanks.

                                                                                              Post to thread

                                                                                              Questions and Answers : Windows : Intel Visual Fortan run-time error




                                                                                              Copyright © 2002-2014 climateprediction.net