climateprediction.net home page
Batch 996 Weather@Home2 East Asia25

Batch 996 Weather@Home2 East Asia25

Message boards : Number crunching : Batch 996 Weather@Home2 East Asia25
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · 12 · Next

AuthorMessage
wateroakley

Send message
Joined: 6 Aug 04
Posts: 195
Credit: 28,254,591
RAC: 10,553
Message 69942 - Posted: 19 Oct 2023, 8:17:15 UTC

The five recalcitrant zips uploaded earrlier today.
ID: 69942 · Report as offensive     Reply Quote
Yeti

Send message
Joined: 5 Aug 04
Posts: 178
Credit: 18,474,913
RAC: 64,888
Message 69943 - Posted: 19 Oct 2023, 9:03:08 UTC - in response to Message 69941.  

Uploads complete!

+1
Supporting BOINC, a great concept !
ID: 69943 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4532
Credit: 18,836,565
RAC: 21,339
Message 69944 - Posted: 19 Oct 2023, 9:08:33 UTC

And the number of hosts reporting completed tasks in last 24 hours has doubled since I looked earlier this morning. My last two should finish shortly.
ID: 69944 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1046
Credit: 16,316,506
RAC: 16,122
Message 69945 - Posted: 19 Oct 2023, 10:19:01 UTC - in response to Message 69937.  
Last modified: 19 Oct 2023, 10:19:45 UTC

The big jump in the number of users reporting tasks is I think evidence that switching to Jasmine has worked.
No, the switch to JASMIN hasn't happened -- CPDN are looking into moving the Korean machine outside the firewall first, as that would be easier for the scientists.
ID: 69945 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4532
Credit: 18,836,565
RAC: 21,339
Message 69946 - Posted: 19 Oct 2023, 11:13:21 UTC - in response to Message 69945.  

No, the switch to JASMIN hasn't happened -- CPDN are looking into moving the Korean machine outside the firewall first, as that would be easier for the scientists.
Something has changed if the stuck uploads have gone through. The increase in number of machines reporting could I suppose be due to slower machines now finishing tasks. My two in the VM have just reported. They take about 20% longer than those using WINE. Next batch I shall attempt running a task under both systems to see what differences there are or if I get a resend and catch it in time.
ID: 69946 · Report as offensive     Reply Quote
zombie67 [MM]
Avatar

Send message
Joined: 2 Oct 06
Posts: 54
Credit: 27,309,613
RAC: 28,128
Message 69947 - Posted: 19 Oct 2023, 12:06:52 UTC

My 40+ uploads finally went through over night. First time the transfers tab has been empty in weeks.
ID: 69947 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1046
Credit: 16,316,506
RAC: 16,122
Message 69948 - Posted: 19 Oct 2023, 12:38:33 UTC - in response to Message 69946.  

No, the switch to JASMIN hasn't happened -- CPDN are looking into moving the Korean machine outside the firewall first, as that would be easier for the scientists.
Something has changed if the stuck uploads have gone through. The increase in number of machines reporting could I suppose be due to slower machines now finishing tasks. My two in the VM have just reported. They take about 20% longer than those using WINE. Next batch I shall attempt running a task under both systems to see what differences there are or if I get a resend and catch it in time.
It's possible something has changed at the Korean side that I'm not aware of. I will ask & report back. I know there was a high level exchange of emails yesterday.

Anyway, whatever's happened, I'm glad!
ID: 69948 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4532
Credit: 18,836,565
RAC: 21,339
Message 69949 - Posted: 19 Oct 2023, 12:51:35 UTC - in response to Message 69948.  

Anyway, whatever's happened, I'm glad!
Agreed! Knowing what if anything has changed, is mainly to satisfy my curiosity, secondly to have ideas for if it happens again.
ID: 69949 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1046
Credit: 16,316,506
RAC: 16,122
Message 69950 - Posted: 19 Oct 2023, 14:02:43 UTC

Confirmation from the Korean site. Their IT staff have opened up the http port on their firewall -- effectively they're temporarily disabling protection against DDoS.
---
CPDN Visiting Scientist
ID: 69950 · Report as offensive     Reply Quote
ChelseaOilman

Send message
Joined: 24 Dec 19
Posts: 32
Credit: 40,631,151
RAC: 113,609
Message 69951 - Posted: 19 Oct 2023, 16:55:16 UTC

Even though the server status page always shows no available tasks it seems there are a few available. Built a new Ryzen 7950X system yesterday and got BOINC running on it last night. Checked it this morning and it was crunching 2 CPDN tasks.

Uploads still going great for me. No tasks waiting to upload.
ID: 69951 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4532
Credit: 18,836,565
RAC: 21,339
Message 69952 - Posted: 19 Oct 2023, 17:17:31 UTC - in response to Message 69951.  

Even though the server status page always shows no available tasks it seems there are a few available. Built a new Ryzen 7950X system yesterday and got BOINC running on it last night. Checked it this morning and it was crunching 2 CPDN tasks.
There will be the odd retreads that have failed on their first and possibly second attempts for a while yet. I see I have just picked up two a few minutes ago.
ID: 69952 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1046
Credit: 16,316,506
RAC: 16,122
Message 69954 - Posted: 19 Oct 2023, 20:51:24 UTC - in response to Message 69952.  

Because the model is flaky for this batch region when restarting (e.g. power off/off), we are losing alot of the 1st & 2nd attempts. That's why we're getting more resends than normal. I am sure alot of the hard fails are simply due to this and not because of an inherent problem with the model perturbations. Not sure whether CPDN will decide to rerun them or not yet.
ID: 69954 · Report as offensive     Reply Quote
rob

Send message
Joined: 5 Jun 09
Posts: 97
Credit: 3,713,662
RAC: 5,691
Message 69955 - Posted: 19 Oct 2023, 20:53:30 UTC - in response to Message 69954.  

Hopefully if they do someone will have a look for the root cause of the issue that has led to the poor re-start performance of these tasks.
ID: 69955 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4532
Credit: 18,836,565
RAC: 21,339
Message 69956 - Posted: 20 Oct 2023, 5:26:14 UTC

two _1 tasks running here. I have the same tasks running both under wine and also in Windows in a VM. which hopefully will enable some comparisons to be made. between the output files. Network activity is turned off for the WINE install of BOINC. In fact network activity is off for both so the zips don't go on the windows install before I get a chance to look at the files.

What I am not sure of Glen is whether even if the science data is the same between both runs, whether tasks that complete under WINE but not using Windows are still invalid or even if there is a way of checking that?
ID: 69956 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,087
RAC: 2,202
Message 69957 - Posted: 20 Oct 2023, 5:50:58 UTC - in response to Message 69952.  

Do you want to see this? It failed pretty fast. on a Windows 10 box.

Task 22347812
Name 	wah2_eas25_a11q_199112_24_996_012224906_2
Workunit 	12224906
Created 	16 Oct 2023, 23:44:03 UTC
Sent 	16 Oct 2023, 23:44:37 UTC
Report deadline 	28 Oct 2024, 5:04:37 UTC
Received 	17 Oct 2023, 0:45:18 UTC
Server state 	Over
Outcome 	Computation error
Client state 	Compute error
Exit status 	0 (0x00000000)
Computer ID 	1512658
Run time 	2 min 41 sec
CPU time 	2 min 23 sec
Validate state 	Invalid
Credit 	0.00
Device peak FLOPS 	4.23 GFLOPS
Application version 	Weather At Home 2 (wah2) v8.24
windows_intelx86
Peak working set size 	166.88 MB
Peak swap size 	160.23 MB
Peak disk usage 	0.01 MB
Stderr 	

<core_client_version>7.24.1</core_client_version>
<![CDATA[
<stderr_txt>
Signal 11 received: Segment violation
Signal 11 received: Software termination signal from kill 
Signal 11 received: Abnormal termination triggered by abort call
Signal 11 received, exiting...
19:47:52 (7736): called boinc_finish(193)
Global Worker:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=2932, iMonCtr=2
Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=7736, selfPID=13976, iMonCtr=1
Model crash detected, will try to restart...
Leaving CPDN_ain::Monitor...
19:47:56 (13976): called boinc_finish(0)

</stderr_txt>
<message>
upload failure: <file_xfer_error>
  <file_name>wah2_eas25_a11q_199112_24_996_012224906_2_r1197333757_1.zip</file_name>
  <error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
<file_xfer_error>
  <file_name>wah2_eas25_a11q_199112_24_996_012224906_2_r1197333757_2.zip</file_name>
  <error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
<file_xfer_error>
  <file_name>wah2_eas25_a11q_199112_24_996_012224906_2_r1197333757_3.zip</file_name>
  <error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
<file_xfer_error>
  <file_name>wah2_eas25_a11q_199112_24_996_012224906_2_r1197333757_4.zip</file_name>
  <error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
<file_xfer_error>
  <file_name>wah2_eas25_a11q_199112_24_996_012224906_2_r1197333757_5.zip</file_name>
  <error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
<file_xfer_error>
  <file_name>wah2_eas25_a11q_199112_24_996_012224906_2_r1197333757_6.zip</file_name>
  <error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
<file_xfer_error>
  <file_name>wah2_eas25_a11q_199112_24_996_012224906_2_r1197333757_7.zip</file_name>
  <error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
<file_xfer_error>
  <file_name>wah2_eas25_a11q_199112_24_996_012224906_2_r1197333757_8.zip</file_name>
  <error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
<file_xfer_error>
  <file_name>wah2_eas25_a11q_199112_24_996_012224906_2_r1197333757_9.zip</file_name>
  <error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
<file_xfer_error>
  <file_name>wah2_eas25_a11q_199112_24_996_012224906_2_r1197333757_10.zip</file_name>
  <error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
<file_xfer_error>
  <file_name>wah2_eas25_a11q_199112_24_996_012224906_2_r1197333757_11.zip</file_name>
  <error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
<file_xfer_error>
  <file_name>wah2_eas25_a11q_199112_24_996_012224906_2_r1197333757_12.zip</file_name>
  <error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
<file_xfer_error>
  <file_name>wah2_eas25_a11q_199112_24_996_012224906_2_r1197333757_13.zip</file_name>
  <error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
<file_xfer_error>
  <file_name>wah2_eas25_a11q_199112_24_996_012224906_2_r1197333757_14.zip</file_name>
  <error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
<file_xfer_error>
  <file_name>wah2_eas25_a11q_199112_24_996_012224906_2_r1197333757_15.zip</file_name>
  <error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
<file_xfer_error>
  <file_name>wah2_eas25_a11q_199112_24_996_012224906_2_r1197333757_16.zip</file_name>
  <error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
<file_xfer_error>
  <file_name>wah2_eas25_a11q_199112_24_996_012224906_2_r1197333757_17.zip</file_name>
  <error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
<file_xfer_error>
  <file_name>wah2_eas25_a11q_199112_24_996_012224906_2_r1197333757_18.zip</file_name>
  <error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
<file_xfer_error>
  <file_name>wah2_eas25_a11q_199112_24_996_012224906_2_r1197333757_19.zip</file_name>
  <error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
<file_xfer_error>
  <file_name>wah2_eas25_a11q_199112_24_996_012224906_2_r1197333757_20.zip</file_name>
  <error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
<file_xfer_error>
  <file_name>wah2_eas25_a11q_199112_24_996_012224906_2_r1197333757_21.zip</file_name>
  <error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
<file_xfer_error>
  <file_name>wah2_eas25_a11q_199112_24_996_012224906_2_r1197333757_22.zip</file_name>
  <error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
<file_xfer_error>
  <file_name>wah2_eas25_a11q_199112_24_996_012224906_2_r1197333757_23.zip</file_name>
  <error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
<file_xfer_error>
  <file_name>wah2_eas25_a11q_199112_24_996_012224906_2_r1197333757_24.zip</file_name>
  <error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
<file_xfer_error>
  <file_name>wah2_eas25_a11q_199112_24_996_012224906_2_r1197333757_restart.zip</file_name>
  <error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
</message>
]]>

ID: 69957 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4532
Credit: 18,836,565
RAC: 21,339
Message 69961 - Posted: 20 Oct 2023, 9:45:42 UTC - in response to Message 69957.  

Do you want to see this? It failed pretty fast. on a Windows 10 box.
That would be the swapping from Global to regional models at end of first model day. Sadly I don't think data from crunchers' machines is likely to help isolate what is happening there. It is proving difficult enough to track down on in house machines where there is access to the code.
ID: 69961 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1046
Credit: 16,316,506
RAC: 16,122
Message 69962 - Posted: 20 Oct 2023, 10:08:05 UTC - in response to Message 69961.  
Last modified: 20 Oct 2023, 10:15:40 UTC

Do you want to see this? It failed pretty fast. on a Windows 10 box.
That would be the swapping from Global to regional models at end of first model day. Sadly I don't think data from crunchers' machines is likely to help isolate what is happening there. It is proving difficult enough to track down on in house machines where there is access to the code.
I have the model compiled under linux and am currently debugging what's going on. Thanks for the offer but am well past that point.

It's restarting the model from a shutdown that risks the model failing like this.
ID: 69962 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1046
Credit: 16,316,506
RAC: 16,122
Message 69963 - Posted: 20 Oct 2023, 10:10:03 UTC - in response to Message 69956.  

two _1 tasks running here. I have the same tasks running both under wine and also in Windows in a VM. which hopefully will enable some comparisons to be made. between the output files. Network activity is turned off for the WINE install of BOINC. In fact network activity is off for both so the zips don't go on the windows install before I get a chance to look at the files.

What I am not sure of Glen is whether even if the science data is the same between both runs, whether tasks that complete under WINE but not using Windows are still invalid or even if there is a way of checking that?
What do you mean by 'invalid'?
ID: 69963 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4532
Credit: 18,836,565
RAC: 21,339
Message 69964 - Posted: 20 Oct 2023, 10:20:42 UTC - in response to Message 69963.  

two _1 tasks running here. I have the same tasks running both under wine and also in Windows in a VM. which hopefully will enable some comparisons to be made. between the output files. Network activity is turned off for the WINE install of BOINC. In fact network activity is off for both so the zips don't go on the windows install before I get a chance to look at the files.

What I am not sure of Glen is whether even if the science data is the same between both runs, whether tasks that complete under WINE but not using Windows are still invalid or even if there is a way of checking that?
What do you mean by 'invalid'?


Not invalid as in rejected by the software but invalid as in useless for the science.
ID: 69964 · Report as offensive     Reply Quote
rob

Send message
Joined: 5 Jun 09
Posts: 97
Credit: 3,713,662
RAC: 5,691
Message 69965 - Posted: 20 Oct 2023, 10:25:09 UTC - in response to Message 69962.  

It's restarting the model from a shutdown that risks the model failing like this.

None my "two minute crashes" have been the result of re-start after a shutdown.
ID: 69965 · Report as offensive     Reply Quote
Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · 12 · Next

Message boards : Number crunching : Batch 996 Weather@Home2 East Asia25

©2024 cpdn.org