climateprediction.net home page
Error report hadam3p_eu

Error report hadam3p_eu

Message boards : Number crunching : Error report hadam3p_eu
Message board moderation

To post messages, you must log in.

AuthorMessage
Niall

Send message
Joined: 18 Dec 13
Posts: 62
Credit: 1,078,935
RAC: 0
Message 48424 - Posted: 18 Mar 2014, 0:31:01 UTC
Last modified: 18 Mar 2014, 0:32:37 UTC

I've just had an error on a hadam3p_eu WU. It stopped running after just over 30 seconds.

The event log says:
17/03/2014 23:46:41 Starting task hadam3p_eu_c15a_1997_1_008565559_1
17/03/2014 23:47:24 Computation for task hadam3p_eu_c15a_1997_1_008565559_1 finished
Output file hadam3p_eu_c15a_1997_1_008565559_1_1.zip for task hadam3p_eu_c15a_1997_1_008565559_1 absent
Output file hadam3p_eu_c15a_1997_1_008565559_1_2.zip for task hadam3p_eu_c15a_1997_1_008565559_1 absent
Output file hadam3p_eu_c15a_1997_1_008565559_1_3.zip for task hadam3p_eu_c15a_1997_1_008565559_1 absent
Output file hadam3p_eu_c15a_1997_1_008565559_1_4.zip for task hadam3p_eu_c15a_1997_1_008565559_1 absent
Output file hadam3p_eu_c15a_1997_1_008565559_1_5.zip for task hadam3p_eu_c15a_1997_1_008565559_1 absent
Output file hadam3p_eu_c15a_1997_1_008565559_1_6.zip for task hadam3p_eu_c15a_1997_1_008565559_1 absent
Output file hadam3p_eu_c15a_1997_1_008565559_1_7.zip for task hadam3p_eu_c15a_1997_1_008565559_1 absent
Output file hadam3p_eu_c15a_1997_1_008565559_1_8.zip for task hadam3p_eu_c15a_1997_1_008565559_1 absent
Output file hadam3p_eu_c15a_1997_1_008565559_1_9.zip for task hadam3p_eu_c15a_1997_1_008565559_1 absent
Output file hadam3p_eu_c15a_1997_1_008565559_1_10.zip for task hadam3p_eu_c15a_1997_1_008565559_1 absent
Output file hadam3p_eu_c15a_1997_1_008565559_1_11.zip for task hadam3p_eu_c15a_1997_1_008565559_1 absent
Output file hadam3p_eu_c15a_1997_1_008565559_1_12.zip for task hadam3p_eu_c15a_1997_1_008565559_1 absent
Output file hadam3p_eu_c15a_1997_1_008565559_1_13.zip for task hadam3p_eu_c15a_1997_1_008565559_1 absent

This looks like a problem with this WU, but:
OpenCL: Intel GPU 0: Intel(R) HD Graphics 4000 (driver version 8.15.10.2696, device version OpenCL 1.1, 1624MB, 1624MB available, 45 GFLOPS peak)
OpenCL CPU: Intel(R) Core(TM) i5-3320M CPU @ 2.60GHz (OpenCL driver vendor: Intel(R) Corporation, driver version 1.1, device version OpenCL 1.1 (Build 30316.30328))Processor: 4 GenuineIntel Intel(R) Core(TM) i5-3320M CPU @ 2.60GHz [Family 6 Model 58 Stepping 9]
Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss htt tm pni ssse3 cx16 sse4_1 sse4_2 popcnt aes syscall nx lm vmx smx tm2 pbe
OS: Microsoft Windows 7: Professional x64 Edition, Service Pack 1, (06.01.7601.00)
Memory: 3.70 GB physical, 7.40 GB virtual
Disk: 282.95 GB total, 166.79 GB free


Assume abort?

HTH
ID: 48424 · Report as offensive     Reply Quote
Niall

Send message
Joined: 18 Dec 13
Posts: 62
Credit: 1,078,935
RAC: 0
Message 48425 - Posted: 18 Mar 2014, 1:28:47 UTC

Okay. I have now had the same problem with WUs
hadam3p_eu_c10i_1997_1_008565387_0
hadam3p_eu_c10m_1997_1_008565391_0
hadam3p_eu_c10q_1997_1_008565395_0


The event log also says:
"Task hadam3p_eu_c10q_1997_1_008565395_0 exited with zero status but no 'finished' file.
If this happens repeatedly you may need to reset the project."

Fine. This has now happened repeatedly.

That said, I am 90-odd hours in and 44 hours from completing another hadam3p_eu WU. Will resetting the project reset this WU? Am I advised to wait until this WU ends before resetting the project?

ID: 48425 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2168
Credit: 64,535,199
RAC: 6,573
Message 48426 - Posted: 18 Mar 2014, 1:59:03 UTC - in response to Message 48425.  

If you reset the project you'll lose any tasks that are currently running, and you just completed three successfully this evening (U.S. time).

That said, the 1997 work units look to have something bad with them and they crash quickly. They are failing on everyone's PCs. You should be fine to continue running the remaining tasks, although you may pick up some more 1997 tasks after they fail on someone else's PC.

ID: 48426 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 48427 - Posted: 18 Mar 2014, 2:03:36 UTC
Last modified: 18 Mar 2014, 2:04:40 UTC

Hello Niall

For your 1st post:

All of the "file .... absent" messages are just a note from BOINC to say that it was told to expect them, but it couldn't find them when the task finished. Fairly obvious to humans, as the model never got to the point where the files would have been created.

The model failed because of: REPLANCA: Current time precedes start time of data as written in the Stderr list on the model's web page.
This is being discussed, although not much yet, in this thread.

As it's already failed and reported back to the server, there's nothing left to Abort. Although you may be left with some debris from the crash which will have to be deleted manually.

***************

The Reset function in BOINC is intended to be used when repeated problems are encountered on a project.
The function deletes EVERYTHING associated with tasks from that project: programs, data files, models/tasks.
And then BOINC can start again with a clean slate and download everything again, including new tasks, if there are any. Which there aren't at present.

As for the message: exited with zero status etc, this happens when the so called "heart beat" is lost for a few seconds. i.e. the communication between the client and the manager parts of BOINC.

And my guess is that it's caused by this other message: Suspended CPDN Monitor - Suspend request from BOINC..., which occurs LOTS of times through your models.

And THAT message is caused by using less than 100% setting for Suspend work if CPU usage is above
0 means no restriction
, which is in the computing preferences section of your account page.

The default setting of 25%, or any value other than no restriction, might be OK for other projects, but not here. These huge climate programs don't like being interrupted, and sooner or later ...

***************

Finally, you appear to be using BOINC version 7.2.39
According to posts on the BOINC/dev web site, that version is known to have a bad bug, and people should upgrade to the latest version with all due haste.

edit
I've been a bit verbose again. :)
ID: 48427 · Report as offensive     Reply Quote
Niall

Send message
Joined: 18 Dec 13
Posts: 62
Credit: 1,078,935
RAC: 0
Message 48428 - Posted: 18 Mar 2014, 2:42:55 UTC - in response to Message 48427.  

Thanks people.

OK, if I am reading you correctly, I need to
a) Go into Computing Preferences and change the "Suspend work if CPU usage is above" field value from 25% (default) to 0% (no restriction) (done) and
b) Upgrade BOINC to the latest available edition (done).

The rest is a bug in the 1997 WUs.

Yes?
ID: 48428 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 48429 - Posted: 18 Mar 2014, 2:49:12 UTC - in response to Message 48428.  

Yes, that's correct.
Hopefully those changes will fix things until the next "interesting occurrence". :)

ID: 48429 · Report as offensive     Reply Quote

Message boards : Number crunching : Error report hadam3p_eu

©2024 climateprediction.net