Ocean model crashed.

Author	Message
Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4532 Credit: 18,763,629 RAC: 18,764	Message 42760 - Posted: 7 Aug 2011, 4:33:29 UTC This model crashed, saying client error. http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=13132040 I suspect this has been covered before but I don't remember seeing it for this particular batch of models. My other task from the same batch is still running. One of the two other tasks from the work unit crashed at 0 time. This one was around the 50% mark. Despite the, "client error" message I suspect it is not my computer at fault. Dave ID: 42760 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4532 Credit: 18,763,629 RAC: 18,764	Message 42783 - Posted: 17 Aug 2011, 20:27:05 UTC - in response to Message 42760. And the other full resolution ocean model I was running has now crashed, albeit another 20odd percent further through. I notice all the other ones in the same work units didn't complete either but I did hope the one that kept going beyond 70% would complete. Result page for it is http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=13122714 Dave ID: 42783 · Reply Quote

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2184 Credit: 64,822,615 RAC: 5,275	Message 42784 - Posted: 17 Aug 2011, 22:48:19 UTC Dave, I don't know if this is right, but the integer benchmark score is HUGE for that processor. Is it significantly overclocked? ID: 42784 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4532 Credit: 18,763,629 RAC: 18,764	Message 42785 - Posted: 18 Aug 2011, 4:30:00 UTC - in response to Message 42784. Thanks for that comment which intrigues me. No it is not significantly overclocked.Multiplier is standard - I haven't tried but don't think it is unlocked. [dave@localhost ~]$ cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 23 model name : Pentium(R) Dual-Core CPU E5400 @ 2.70GHz stepping : 10 cpu MHz : 2699.621 cache size : 2048 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 2 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm xsave lahf_lm tpr_shadow vnmi flexpriority bogomips : 5399.24 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 23 model name : Pentium(R) Dual-Core CPU E5400 @ 2.70GHz stepping : 10 cpu MHz : 2699.621 cache size : 2048 KB physical id : 0 siblings : 2 core id : 1 cpu cores : 2 apicid : 1 initial apicid : 1 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm xsave lahf_lm tpr_shadow vnmi flexpriority bogomips : 5399.60 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: I can't think why the integer benchmark score should be HUGE when I haven't been playing around with overclocking. Dave ID: 42785 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 42786 - Posted: 18 Aug 2011, 10:30:40 UTC Get a copy of cpu-z and run that. It'll tell you the current speed of the processor. Backups: Here ID: 42786 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4532 Credit: 18,763,629 RAC: 18,764	Message 42787 - Posted: 18 Aug 2011, 15:13:51 UTC - in response to Message 42786. Thanks Les, - cpu-z doesn't seem to have a linux version so I have installed PerlMon which claims to tell you your actual cpu frequency and that tells me 2699.814MHz which, not having ever increased the clock frequency on this box by more than 10Hz makes me even more curious as to why the integer benchmark should be, "HUGE" for the processor I have. I am disinclined to believe that anyone who shares the house with me is fiddling as neither have been converted to the power of the penguin. Dave ID: 42787 · Reply Quote

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2184 Credit: 64,822,615 RAC: 5,275	Message 42789 - Posted: 19 Aug 2011, 19:21:18 UTC Perhaps it's an overestimation from a new version of BOINC. I don't have 6.12.xx running on any of my Linux boxes so am not sure if it's inflated by a new version? ID: 42789 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4532 Credit: 18,763,629 RAC: 18,764	Message 42791 - Posted: 20 Aug 2011, 9:13:47 UTC - in response to Message 42789. Could be, I haven't ever paid attention to the benchmark scores before upgrading to the latest BOINC so couldn't comment. It still leaves the question mark as to why the crashes. In one of the two tasks in question none of the other tasks in the work unit completed. In the other one completed and the other crashed.If there are likely to be any more of these models, should I remove them from my preferences or not? I normally back up about once a week but haven't yet tried restoring a crashed model. What happens if restoring is successful? are the results accepted after the task is already showing, "Error while computing" in the status column? Sorry if many of these questions are ones that have been answered a number of times already. Dave ID: 42791 · Reply Quote

JIM Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022	Message 42792 - Posted: 20 Aug 2011, 14:09:58 UTC - in response to Message 42791. Yes the results are excepted and the data is used by the Scientists just like any other completed WU. The one thing that can be off-putting is that when a restored WU finishes and sends the results a line will appear in messages that says it was previously reported as error. Just ignore this message as it only applies to other projects and has no meaning in CPDN. ID: 42792 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4532 Credit: 18,763,629 RAC: 18,764	Message 42793 - Posted: 20 Aug 2011, 18:27:06 UTC - in response to Message 42792. Thanks Jim, I will try and back up twice a week and in the event of another crash will try restore to see what happens. Dave ID: 42793 · Reply Quote

JIM Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022	Message 42798 - Posted: 23 Aug 2011, 13:42:14 UTC As you can see task hadcm3n_yfi4_1900_40_007352950_0 crashed at the very end with the 4.zip file missing. 8/23/2011 4:29:03 AM climateprediction.net Generated new computer cross-project ID: ead5a08edbfa16e91d2b66991616c4ee 8/23/2011 4:29:04 AM climateprediction.net Computation for task hadcm3n_yfi4_1900_40_007352950_0 finished 8/23/2011 4:29:04 AM climateprediction.net Output file hadcm3n_yfi4_1900_40_007352950_0_4.zip for task hadcm3n_yfi4_1900_40_007352950_0 absent 8/23/2011 4:29:05 AM climateprediction.net Restarting task hadam3p_pnw_3204_1980_1_007395222_0 using hadam3p_pnw version 609 Could this have anything to do with the recent change in the server used. I do have a backup that I made only 6 hours before the model finished, so if a solution can be found I could restore and run it to the end again. ID: 42798 · Reply Quote

JIM Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,363,583 RAC: 5,022	Message 42809 - Posted: 24 Aug 2011, 19:34:37 UTC - in response to Message 42798. I restored task hadcm3n_yfi4_1900_40_007352950_0 from the 6 hour backup and ran it again. Same result. It crashes appormx. 2 hours from end. The stderr are shown below. The one that look significant to me is: 23:50:55 (4200): Can't acquire lock file (32) - waiting 35s It shows up twice. Unless someone can come up with a correctable reason for the failure I will delete the backup and go one. I hate to just write off 900 hours of crunching. <core_client_version>6.10.58</core_client_version> <![CDATA[ <message> - exit code 193 (0xc1) </message> <stderr_txt> CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Quit request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... CPDN Monitor - Quit request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... CPDN Monitor - Quit request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... 02:11:08 (7860): Can't acquire lockfile (32) - waiting 35s 02:11:15 (3840): No heartbeat from core client for 30 sec - exiting CPDN Monitor - No 'heartbeat' from BOINC... 23:50:55 (4200): Can't acquire lockfile (32) - waiting 35s 23:51:10 (7860): No heartbeat from core client for 30 sec - exiting CPDN Monitor - No 'heartbeat' from BOINC... CPDN Monitor - Quit request from BOINC... CPDN Monitor - Quit request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Quit request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Quit request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... Signal 11 received, exiting... Called boinc_finish </stderr_txt> ] ID: 42809 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 42810 - Posted: 24 Aug 2011, 22:32:45 UTC - in response to Message 42809. The lock file problem could be several things, but is most likely caused by the anti virus program scanning each file that becomes active to check it for viruses before allowing you to use it. Some av programs are more aggressive than others in the way they work. Which is why it's long been recommended to block av scanning of the entire BOINC data section, both manually started scans AND scheduled scans. Backups: Here ID: 42810 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4532 Credit: 18,763,629 RAC: 18,764	Message 42811 - Posted: 25 Aug 2011, 4:17:36 UTC - in response to Message 42810. Still no closer to working out why my units crashed. I assume something may be in this from the errors but am afraid they mean nothing to me. I do know it isn't an antivirus program scanning anything in my case however. SIGABRT: abort called Stack trace (17 frames): ../../projects/climateprediction.net/hadcm3n_6.07_i686-pc-linux-gnu(boinc_catch_signal+0x6f)[0x80b80df] [0xffffe400] [0xffffe430] /lib/libc.so.6(gsignal+0x51)[0xf7540ce1] /lib/libc.so.6(abort+0x182)[0xf7542632] /lib/libc.so.6(+0x65e4d)[0xf757ce4d] /lib/libc.so.6(+0x6bba1)[0xf7582ba1] /usr/lib/libstdc++.so.6(_ZdlPv+0x21)[0xf7763321] /usr/lib/libstdc++.so.6(_ZdaPv+0x1d)[0xf776337d] ../../projects/climateprediction.net/hadcm3n_6.07_i686-pc-linux-gnu[0x8053e8e] ../../projects/climateprediction.net/hadcm3n_6.07_i686-pc-linux-gnu[0x8057bc4] ../../projects/climateprediction.net/hadcm3n_6.07_i686-pc-linux-gnu[0x804f232] ../../projects/climateprediction.net/hadcm3n_6.07_i686-pc-linux-gnu[0x8050491] ../../projects/climateprediction.net/hadcm3n_6.07_i686-pc-linux-gnu[0x805112c] ../../projects/climateprediction.net/hadcm3n_6.07_i686-pc-linux-gnu[0x805137a] /lib/libc.so.6(__libc_start_main+0xe6)[0xf752db96] ../../projects/climateprediction.net/hadcm3n_6.07_i686-pc-linux-gnu(__gxx_personality_v0+0x169)[0x804cb51] Dave ID: 42811 · Reply Quote

Greg van Paassen Send message Joined: 17 Nov 07 Posts: 142 Credit: 4,271,370 RAC: 0	Message 42812 - Posted: 25 Aug 2011, 5:02:35 UTC - in response to Message 42811. Last modified: 25 Aug 2011, 5:04:11 UTC Dave, I see there's a large number of "Suspended CPDN monitor" messages -- your computer is throttling back the tasks quite frequently. In the past, people have noticed stability problems when CPDN tasks are starved of resources. Have you tried this:- In Boinc Manager - Advanced menu - Preferences: On the "processor usage" tab, set 'Use at most ... % of CPU' to 100.00, or at a minimum, 80.00. (If you're worried about heat, it'd be best to set "On multiprocessor systems, use at most 50.00 % of the processors", i.e. process one task at a time.) On the "disk and memory usage" tab, ensure "Leave applications in memory when suspended" is selected. Also on this tab, for two HadCM3Ns in 2GB, it'd be best to set the Memory Usage figure "Use at most ... % when computer is in use" to at least 80.00 % to be on the safe side. Likewise for "Use at most ... when computer is idle". If you've done all these and still have the problem, then it might be worth:- * having a good vacuum-out of the CPU's heat sink, and unseating and re-seating the RAM modules * running mprime or memtest86+ for 48 hours to check your computer's RAM * upgrading its power supply to a newer, name brand model. That last item helped me with a stability problem. Newer PSUs seem to reject power supply noise better than ones from a few years ago. ID: 42812 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4532 Credit: 18,763,629 RAC: 18,764	Message 42813 - Posted: 25 Aug 2011, 15:54:28 UTC - in response to Message 42812. Thanks Greg, Only change in settings was to increase % memory usage when computer is in use. Shouldn't be much dust in system as it is fairly new. I will probably increase memory soon to 4GB which may make a difference. With the regional models I don't seem to have any problems. Dave ID: 42813 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4532 Credit: 18,763,629 RAC: 18,764	Message 42814 - Posted: 25 Aug 2011, 17:37:03 UTC - in response to Message 42813. Just done a temperature check - with two regional models running it is 45C, dropping to 41C if I disable one of the tasks.- Voltages seem to be stable too with less than .1v change in either 12V or 5V line when I stop a task. Dave ID: 42814 · Reply Quote

Greg van Paassen Send message Joined: 17 Nov 07 Posts: 142 Credit: 4,271,370 RAC: 0	Message 42815 - Posted: 25 Aug 2011, 21:36:42 UTC - in response to Message 42814. OK, Dave, that's good. Increasing the memory % might have done the trick. If not, the only thing left is to run mprime and memtest86+ for 24 - 48 hours each. You can't use the PC for anything else while memtest86+ is running, though. ID: 42815 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4532 Credit: 18,763,629 RAC: 18,764	Message 42832 - Posted: 29 Aug 2011, 16:06:36 UTC - in response to Message 42815. Memory now up to 4GB. It is still going to be a long wait to see if the HADCM3 finishes or not. ID: 42832 · Reply Quote

Greg van Paassen Send message Joined: 17 Nov 07 Posts: 142 Credit: 4,271,370 RAC: 0	Message 42833 - Posted: 29 Aug 2011, 23:25:37 UTC - in response to Message 42832. Yes, I estimate about 4 weeks with the PC running 24/7, less what has been done so far. Anyway, good luck! ID: 42833 · Reply Quote