climateprediction.net home page
Checkpoints for hadcm3 models

Checkpoints for hadcm3 models

Message boards : Number crunching : Checkpoints for hadcm3 models
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile old_user60427
Avatar

Send message
Joined: 4 Mar 05
Posts: 24
Credit: 243,647
RAC: 0
Message 22095 - Posted: 16 Apr 2006, 18:07:47 UTC
Last modified: 16 Apr 2006, 18:08:37 UTC

I know slab models checkpoint every 144 timesteps, so if your model gets halted (system shutdown, boinc suspends it and swaps it out of memory), you lose at most 143 timesteps or ~400 seconds on my systems.

I recently started running a coupled model (). I have noted that this, unlike SUlphur and slab, only writes to the logfile after 432 timesteps (an improvement). Looking at progress in the logs I get the impression that this model also only checkpoints every 432 TS -- I always see the same pattern, i.e. model restarts after the last multiple of 432TS.

hadcm3lb_59mn_05033739 - PH 1 TS 0019009 A - 25/08/1921 00:30 - H:M:S=0016:25:01 AVG= 3.11 DLT= 1.78
2006-04-16 11:39:03 [climateprediction.net] Pausing result hadcm3lb_59mn_05033739_1 (removed from memory)
2006-04-16 11:39:03 [SETI@home] Restarting result 01se99aa.24541.23808.665908.1.255_2 using setiathome version 470

... many lines removed

2006-04-16 15:39:47 [climateprediction.net] Restarting result hadcm3lb_59mn_05033739_1 using hadcm3lb version 508
2006-04-16 15:39:47 [SETI@home] Pausing result 05mr99aa.17103.30529.79836.1.22_2 (removed from memory)
2006-04-16 15:39:48 [---] request_reschedule_cpus: process exited
Beginning work on result hadcm3lb_59mn_05033739_1...
Starting model in /home/boinc/projects/www.climateprediction.net...
Created shared memory region key = 77310 of size 655036 bytes
.so shmem return code = 0
Starting model ID hadcm3lb_59mn_05033739 Phase 1
Climate model starting - use graphics to monitor progress.
Or visit the website to see the graphs for this run.
hadcm3lb_59mn_05033739 - PH 1 TS 0019009 A - 25/08/1921 00:30 - H:M:S=0016:25:01 AVG= 3.11 DLT= 0.00

Anybody who knows the details of this?

I cannot enable \"Leave applications in memory while preempted?\" because I then end up with CPDN and seti both running at the same time; an issue with boinc clients after 4.19 that I have not found a solution for that; will try the new 5.4.x client once it is available
ID: 22095 · Report as offensive     Reply Quote
Profile astroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 22098 - Posted: 16 Apr 2006, 20:19:05 UTC
Last modified: 16 Apr 2006, 21:40:32 UTC

Your analysis is correct. Checkpoints every 432 timesteps -- every six Model Days instead of the previous three and, there are 72 TS per Model Day now, rather than 48. (48 Atmosphere, 24 Ocean.)

I run only CPDN, so can\'t comment about the memory contention with SETI. First I\'ve heard of it, actually. Seems very odd that boinc would give control to two clients at the same time. (Edit: CPDN should be inactive when suspended, even if occupying a chunk of memory, and vice versa when SETI is inactive. What are the symptoms? What shows in the logs?) BOINC guru, help!

So, you\'ll lose an average of 216 TS every time CM is suspended and removed from memory. Expensive.

[Edited for typo.]
"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 22098 · Report as offensive     Reply Quote
Profile old_user60427
Avatar

Send message
Joined: 4 Mar 05
Posts: 24
Credit: 243,647
RAC: 0
Message 22150 - Posted: 17 Apr 2006, 16:53:18 UTC - in response to Message 22098.  

Thanks for the info. Will see how to work around this. Ideally I get the \"Leave applications in memory while preempted?\" thingy to work (it worked like a charm with BOINC 4.19.4). I have just switched to a new boinc version (trux calibrating client based off 5.3.12) and see whether taht swaps teh science apps better.

What happens is is that both apps stay in RAM (OK; I have got enough), but --and here\'s the catch-- both continue to run, so eitehr gets ~45% of CPU and the content switching between CPDN and SETI screws up any caching that is done by CPU.

The logs provide no real clue. Here is a chunk showing what I described:

2006-04-10 15:17:11 [climateprediction.net] Resuming result hadcm3lb_59mn_05033739_1 using hadcm3lb version 508
2006-04-10 15:17:11 [SETI@home] Pausing result 05oc02aa.7805.29506.878390.1.78_2 (left in memory)
Resuming CPDN!
hadcm3lb_59mn_05033739 - PH 1 TS 0006913 A - 07/03/1921 00:30 - H:M:S=0005:58:46 AVG= 3.11 DLT= 1.32
2006-04-10 16:17:11 [climateprediction.net] Pausing result hadcm3lb_59mn_05033739_1 (left in memory)
2006-04-10 16:17:11 [SETI@home] Resuming result 05oc02aa.7805.29506.878390.1.78_2 using setiathome version 470

Note that that I have one line from cpdn in one hour. When the application is not kept in RAM a switch looks like:

2006-04-12 13:01:41 [SETI@home] Computation for result 25se02aa.7060.496.228398.1.105_3 finished
2006-04-12 13:01:41 [climateprediction.net] Resuming result hadcm3lb_59mn_05033739_1 using hadcm3lb version 508
Resuming CPDN!
hadcm3lb_59mn_05033739 - PH 1 TS 0008641 A - 01/04/1921 00:30 - H:M:S=0007:27:34 AVG= 3.11 DLT= 1.43
hadcm3lb_59mn_05033739 - PH 1 TS 0009073 A - 07/04/1921 00:30 - H:M:S=0007:49:35 AVG= 3.11 DLT= 1.59
hadcm3lb_59mn_05033739 - PH 1 TS 0009505 A - 13/04/1921 00:30 - H:M:S=0008:11:33 AVG= 3.10 DLT= 1.37
2006-04-12 14:01:42 [climateprediction.net] Pausing result hadcm3lb_59mn_05033739_1 (removed from memory)
2006-04-12 14:01:42 [SETI@home] Starting result 25se02aa.7060.1265.129826.1.176_0 using setiathome version 470
Cleaning up graphics data...
Detaching shared memory...
2006-04-12 14:01:44 [---] request_reschedule_cpus: process exited

I complete thee times as much ... Will try once the new BOINC version is runnign stable.
ID: 22150 · Report as offensive     Reply Quote

Message boards : Number crunching : Checkpoints for hadcm3 models

©2024 climateprediction.net