climateprediction.net home page
PC Freeze stops undamaged WU??

PC Freeze stops undamaged WU??

Questions and Answers : Windows : PC Freeze stops undamaged WU??
Message board moderation

To post messages, you must log in.

AuthorMessage
m.mitch
Avatar

Send message
Joined: 10 Jan 06
Posts: 55
Credit: 1,530,621
RAC: 6,154
Message 21072 - Posted: 5 Mar 2006, 14:29:59 UTC
Last modified: 5 Mar 2006, 15:00:35 UTC

I\'ve had a look through the fora, but can\'t find anything quite like this one.
The BOINC Manager downloaded an other WU after a freeze and hard reboot but I can\'t see any differences in the two sets of xml files other than the names.
No error had been reported to the server, so I was expecting an easy restart.
Does anyone have a suggestion as to which file I should be looking at for some error or indication of curruption?

It wouldn\'t be something as simple as a changed Host ID would it?



Click here to join the #1 Aussie Alliance on Climate Prediction
ID: 21072 · Report as offensive     Reply Quote
Profile MikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 21075 - Posted: 5 Mar 2006, 15:28:20 UTC

If the boinc manager loses contact with the model for a period of time, it assumes that the model has crashed and downloads a new one.

As long as both models look OK, what I would suggest is :

* Set \'no more work\' against the project (this prevents boinc downloading other models unnecessarily)

* Suspend or abort the newest model (since it\'s done no work)

I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 21075 · Report as offensive     Reply Quote
m.mitch
Avatar

Send message
Joined: 10 Jan 06
Posts: 55
Credit: 1,530,621
RAC: 6,154
Message 21077 - Posted: 5 Mar 2006, 17:04:16 UTC - in response to Message 21075.  
Last modified: 5 Mar 2006, 17:12:09 UTC

If the boinc manager loses contact with the model for a period of time, it assumes that the model has crashed and downloads a new one.

As long as both models look OK, what I would suggest is :

* Set \'no more work\' against the project (this prevents boinc downloading other models unnecessarily)

* Suspend or abort the newest model (since it\'s done no work)


I have the new one suspended, but the old one doesn\'t show up in BOINC Manager and I can\'t work out why. I\'m stumped, unless someone knows where the Host ID goes, that\'s all I can think of.


Click here to join the #1 Aussie Alliance on Climate Prediction
ID: 21077 · Report as offensive     Reply Quote
Profile MikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 21078 - Posted: 5 Mar 2006, 22:31:04 UTC - in response to Message 21077.  

There\'s no evidence of a model failure on the website, but perhaps there may be something on your \'messages\' tab, or in the log files (StdErrGui.txt, StdErrDae.txt).

There are indeed two computer IDs with active results, I take it you only actually have one in progress?
I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 21078 · Report as offensive     Reply Quote
m.mitch
Avatar

Send message
Joined: 10 Jan 06
Posts: 55
Credit: 1,530,621
RAC: 6,154
Message 21095 - Posted: 6 Mar 2006, 7:20:29 UTC - in response to Message 21078.  
Last modified: 6 Mar 2006, 7:24:16 UTC

There\'s no evidence of a model failure on the website, but perhaps there may be something on your \'messages\' tab, or in the log files (StdErrGui.txt, StdErrDae.txt).

There are indeed two computer IDs with active results, I take it you only actually have one in progress?


Yes, I only have the one on that PC. I merged them but I\'m not sure if there is a way to merge the new into the old or, indeed, if that would make any difference. The other PC has very similar specs, but is not shown in the above hyperlink.

This is the last time it contacted the server:
2006-03-04 14:19:28 [---] Resuming round-robin CPU scheduling.
2006-03-04 14:19:28 [climateprediction.net] Resuming result sulphur_iy22_000883946_0 using sulphur_cycle version 422

This is the last entry before the restart & what followed:
2006-03-04 15:10:55 [SETI@home] Started download of better_banner.jpg

To pause/resume tasks hit CTRL-C, to exit hit CTRL-BREAK
2006-03-04 21:07:36 [---] Starting BOINC client version 5.2.13 for windows_intelx86
2006-03-04 21:07:36 [---] libcurl/7.14.0 OpenSSL/0.9.8 zlib/1.2.3
2006-03-04 21:07:36 [---] Data directory: C:\\Program Files\\BOINC
2006-03-04 21:07:36 [---] Missing open tag in state file.
2006-03-04 21:07:36 [---] Processor: 2 GenuineIntel Intel(R) Pentium(R) 4 CPU 3.20GHz
2006-03-04 21:07:36 [---] Memory: 1023.48 MB physical, 2.40 GB virtual
2006-03-04 21:07:36 [---] Disk: 31.25 GB total, 12.08 GB free
2006-03-04 21:07:36 [---] Version change detected (0.0.0 -> 5.2.13); running CPU benchmarks
2006-03-04 21:07:36 [Einstein@Home] Computer ID: not assigned yet; location: home; project prefs: default
2006-03-04 21:07:36 [Leiden Classical] Computer ID: not assigned yet; location: home; project prefs: default
2006-03-04 21:07:36 [LHC@home] Computer ID: not assigned yet; location: home; project prefs: default
2006-03-04 21:07:36 [SETI@home] Computer ID: not assigned yet; location: home; project prefs: default
2006-03-04 21:07:36 [climateprediction.net] Computer ID: not assigned yet; location: home; project prefs: default
2006-03-04 21:07:36 [uFluids] Computer ID: not assigned yet; location: home; project prefs: default
2006-03-04 21:07:36 [World Community Grid] Computer ID: not assigned yet; location: ; project prefs: default

Just the usual stuff after this. What seems strange is that all the other projects reloaded after I merged them.

I don\'t know if any of that is going to be of help to you. I can\'t see anything out of the ordinary except that all the hosts were dropped. I haven\'t had that happen before. Now that it\'s summer here, I\'ve had this PC seeze up a couple of times. It\'s not OC\'ed, it looks pretty clean but I may have to reset the heatsink. But that\'s another story.

Thanks for the help, I\'m almost at the point of dropping the work I\'ve done so far and running the new WU but I\'m just having trouble letting go of all those hours :)



Click here to join the #1 Aussie Alliance on Climate Prediction
ID: 21095 · Report as offensive     Reply Quote
Profile MikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 21096 - Posted: 6 Mar 2006, 8:07:54 UTC

...
2006-03-04 21:07:36 [---] Missing open tag in state file.
2006-03-04 21:07:36 [---] Processor: 2 GenuineIntel Intel(R) Pentium(R) 4 CPU 3.20GHz
2006-03-04 21:07:36 [---] Memory: 1023.48 MB physical, 2.40 GB virtual
2006-03-04 21:07:36 [---] Disk: 31.25 GB total, 12.08 GB free
2006-03-04 21:07:36 [---] Version change detected (0.0.0 -> 5.2.13); running CPU benchmarks
2006-03-04 21:07:36 [Einstein@Home] Computer ID: not assigned yet; location: home; project prefs: default
2006-03-04 21:07:36 [Leiden Classical] Computer ID: not assigned yet; location: home; project prefs: default
2006-03-04 21:07:36 [LHC@home] Computer ID: not assigned yet; location: home; project prefs: default
2006-03-04 21:07:36 [SETI@home] Computer ID: not assigned yet; location: home; project prefs: default
2006-03-04 21:07:36 [climateprediction.net] Computer ID: not assigned yet; location: home; project prefs: default
2006-03-04 21:07:36 [uFluids] Computer ID: not assigned yet; location: home; project prefs: default
2006-03-04 21:07:36 [World Community Grid] Computer ID: not assigned yet; location: ; project prefs: default
...


It looks like the boinc files were corrupted somehow judging from that... I\'m not sure if it\'s possible to restore the model. There is a boinc wiki about restoring multi-project setups from backup, while that\'s not directly relevant it may offer enough details on the internal workings of the state file to look through and see if there are any ideas there.
I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 21096 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 21098 - Posted: 6 Mar 2006, 8:51:43 UTC

Yes, it looks like the client_stae.xml file has been corrupted.
Do you have a backup? If so restore. Otherwise, you\'ll have to start again.
That file has numerous sections, all with \'signed\' (i.e. checksumed), sections to prevent tampering.

ID: 21098 · Report as offensive     Reply Quote
m.mitch
Avatar

Send message
Joined: 10 Jan 06
Posts: 55
Credit: 1,530,621
RAC: 6,154
Message 21099 - Posted: 6 Mar 2006, 15:05:40 UTC - in response to Message 21098.  
Last modified: 6 Mar 2006, 15:15:09 UTC

Yes, it looks like the client_stae.xml file has been corrupted.
Do you have a backup? If so restore. Otherwise, you\'ll have to start again.
That file has numerous sections, all with \'signed\' (i.e. checksumed), sections to prevent tampering.


I have found the old work unit! Thanks to both MikeMars and Les Bayliss.
I\'ve now gone back to run down the existing non-CPDN work units.
I won\'t cancel the new CPDN wu until I\'ve successfully recovered from the mess.
I have implemented a scripted backup regime :) Oh, I did have a backup, that\'s where I got the replacement client_state.xml file. I\'ve \"automated\" somewhat.
I’ll post back when I’ve completed the next step, to let you know how successful I\'ve been, or not.

Thanks again, I appreciate your patience.

Mike



Click here to join the #1 Aussie Alliance on Climate Prediction
ID: 21099 · Report as offensive     Reply Quote
Profile MikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 21100 - Posted: 6 Mar 2006, 15:59:36 UTC

Sounds like there may be a good outcome :-) Didn\'t look very positive at first.
I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 21100 · Report as offensive     Reply Quote
m.mitch
Avatar

Send message
Joined: 10 Jan 06
Posts: 55
Credit: 1,530,621
RAC: 6,154
Message 21126 - Posted: 7 Mar 2006, 13:38:39 UTC - in response to Message 21100.  

Sounds like there may be a good outcome :-) Didn\'t look very positive at first.


It has created some odd results, here\'s just one example:
7/03/2006 2:20:48 AM|Einstein@Home|ACTIVE_TASKS::restart_tasks(); missing files
7/03/2006 2:20:48 AM|Einstein@Home|Unrecoverable error for result r1_1009.0__165_S4R2a_1 (One or more missing files)

I think Einstein had already crunched it with the other host session. I had heaps of strange error messages, all of them seemed to point to missing bits. So I just let them go. Now all I have to do, is wait a couple of days while this host \"version\" runs down its other project WU\'s. Then switch back.

When I reinstall the client_state.xml file I\'ll know if it\'s all worked out. I\'ll wait in hope ;-)


Click here to join the #1 Aussie Alliance on Climate Prediction
ID: 21126 · Report as offensive     Reply Quote
m.mitch
Avatar

Send message
Joined: 10 Jan 06
Posts: 55
Credit: 1,530,621
RAC: 6,154
Message 21184 - Posted: 10 Mar 2006, 11:49:06 UTC - in response to Message 21126.  

All is well now. I only lost a handfull of hours crunching from the CPDN WU, which I assume restarted from when it was backed up. That\'s OK, it still saved me about 200 hours :-)

All the other projects are back on track and looking good.

Again thanks to MikeMArs and Les Bayliss


Click here to join the #1 Aussie Alliance on Climate Prediction
ID: 21184 · Report as offensive     Reply Quote
m.mitch
Avatar

Send message
Joined: 10 Jan 06
Posts: 55
Credit: 1,530,621
RAC: 6,154
Message 21212 - Posted: 12 Mar 2006, 2:51:57 UTC - in response to Message 21075.  

If the boinc manager loses contact with the model for a period of time, it assumes that the model has crashed and downloads a new one.

As long as both models look OK, what I would suggest is :

* Set \'no more work\' against the project (this prevents boinc downloading other models unnecessarily)

* Suspend or abort the newest model (since it\'s done no work)


I\'ve recovred the old one intact and I\'m running that now. Ultimatly I aborted the newer one and merged all the phantom computers. That leaves me with two P4 3.2 GHz machines running one CPDN WU each. I\'ve even had a few new trickle ups and my credits are on the rise.

Thanks for all the help Mike, we made it in the end :-)



Click here to join the #1 Aussie Alliance on Climate Prediction
ID: 21212 · Report as offensive     Reply Quote
Profile old_user17289

Send message
Joined: 13 Sep 04
Posts: 228
Credit: 354,979
RAC: 0
Message 21229 - Posted: 13 Mar 2006, 5:39:59 UTC

It\'s always good to hear that someone managed to recover from a backup!
ID: 21229 · Report as offensive     Reply Quote
m.mitch
Avatar

Send message
Joined: 10 Jan 06
Posts: 55
Credit: 1,530,621
RAC: 6,154
Message 21230 - Posted: 13 Mar 2006, 5:44:31 UTC - in response to Message 21229.  

It\'s always good to hear that someone managed to recover from a backup!


I was so surprised I posted the recovery details on our team web site!


Click here to join the #1 Aussie Alliance on Climate Prediction
ID: 21230 · Report as offensive     Reply Quote

Questions and Answers : Windows : PC Freeze stops undamaged WU??

©2024 climateprediction.net