climateprediction.net (CPDN) home page
Thread 'Failing tasks with exit code 12 and 25'

Thread 'Failing tasks with exit code 12 and 25'

Questions and Answers : Unix/Linux : Failing tasks with exit code 12 and 25
Message board moderation

To post messages, you must log in.

AuthorMessage
LinAGKar

Send message
Joined: 9 Nov 15
Posts: 8
Credit: 310,778
RAC: 0
Message 62421 - Posted: 11 May 2020, 7:15:22 UTC

I have a bunch of tasks on one of my computers which failed with exit code 12 or 25.

On ones with exit code 12 I see an error like:
checkdir:  cannot create extraction directory: hadam4h_a21t_209911_4_867_012014556
           File exists

On ones with exit code 25 I see a bunch of errors like:
Could not read directory attributes: Value too large for defined data type

or
checkdir error:  cannot create hadam4h_a0wt_209411_4_868_012016230/datain/ancil/ctldata
                 File exists
                 unable to process datain/ancil/ctldata/stasets/.
ID: 62421 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 62422 - Posted: 11 May 2020, 16:34:45 UTC

After looking at several of the tasks I found a few others failing with similar but not the exact same messages as you had. Took a while because most of the failures were due to missing 32bit libs. The fact that others also failed with similar errors suggests a problem with the tasks. I will let the project know.
ID: 62422 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 62423 - Posted: 11 May 2020, 20:37:23 UTC

I have seen this error before looking at tasks that have failed for other reasons.

I am clutching at straws a bit here but a couple of things worth checking.

1 That you have enough disk space allocated. Unlikely this is a problem with only 8 cores.)

2. Something to do with Ram and or cache memory. If the tasks complete fine when you restrict BOINC to only 4 cores at a time then cache memory would be the most likely reason.

Sarah at the project replied to my post and thinks the strange error messages you are getting are likely not directly from the crash but because something doesn't clean up properly after the crash.

3. Just thought of this, it could be that they are crashing because you have a corrupted file downloaded. If you detach from CPDN then re-attach that will download fresh copies of all the relevant files and resolve the problem. (Might be worth trying that one first.)

It isn't that common an issue I think as I have never seen it on my own boxes and only rarely when looking through crashed tasks looking for patterns.
ID: 62423 · Report as offensive     Reply Quote
LinAGKar

Send message
Joined: 9 Nov 15
Posts: 8
Credit: 310,778
RAC: 0
Message 62424 - Posted: 12 May 2020, 12:41:02 UTC - in response to Message 62423.  

1. Boinc is using 6.7 GB, and it says it has another 125.46 GB available.
2. I have already restricted Boinc to 1 core
3. I have now detached and reattached the project, but it says communication deferred 1 day, so we'll have to wait and see what happens.
ID: 62424 · Report as offensive     Reply Quote
LinAGKar

Send message
Joined: 9 Nov 15
Posts: 8
Credit: 310,778
RAC: 0
Message 62437 - Posted: 19 May 2020, 9:44:35 UTC - in response to Message 62423.  

I'm still getting the same errors after detaching and reattaching.
ID: 62437 · Report as offensive     Reply Quote
LinAGKar

Send message
Joined: 9 Nov 15
Posts: 8
Credit: 310,778
RAC: 0
Message 62441 - Posted: 19 May 2020, 17:20:40 UTC

I have an idea about what I think is causing this. The computer getting these errors have the BOINC directory on XFS, which uses 64-bit inode numbers, but CPDN seems to be 32-bit, and by default in 32-bit applications, the stat() and readdir() functions use 32-bit inode numbers, hence the:

Could not read directory attributes: Value too large for defined data type


To fix this, CPDN needs to be compiled with _FILE_OFFSET_BITS=64 or use stat64 and readdir64; or even better, compiled as 64-bit. See https://www.mjr19.org.uk/sw/inodes64.html for a longer explanation.
ID: 62441 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 62442 - Posted: 19 May 2020, 18:21:10 UTC - in response to Message 62441.  

The Met Office programs used by the researchers, are all 32 bit code, and will stay that way for historical comparison of data.

It's up to users to make sure that their computers have the necessary 32 bit libraries.

See '*** Running 32bit CPDN from 64bit Linux - Important *** ar the top of this Linux section for how to do this for various Linux versions.
ID: 62442 · Report as offensive     Reply Quote
LinAGKar

Send message
Joined: 9 Nov 15
Posts: 8
Credit: 310,778
RAC: 0
Message 62443 - Posted: 19 May 2020, 19:31:16 UTC - in response to Message 62442.  

The problem here has nothing to do with missing libraries. The problem is that XFS uses 64-bit inode numbers, so the stat and readdir system calls returns 64-bit inode numbers, but hadam4_8.52_i686-pc-linux-gnu uses the old stat and readdir functions which only work for 32-bit inodes. It's not something you can fix just by installing extra dependencies.

If it can't be 64-bit (and unless it's poorly written, that shouldn't change the data) then that leaves the other two workarounds I mentioned:


    Compile CPDN with _FILE_OFFSET_BITS=64. Unless the inode numbers are actually used for anything, this should not change anything else.
    Replace calls to stat and readdir with stat64 and readdir64.



There is also the LD_PRELOAD trick mentioned there, I'll try it and see how that works out, though there doesn't seem to be a way to apply it to just one projects, so I'll need to run all of boinc-client with it.

ID: 62443 · Report as offensive     Reply Quote
LinAGKar

Send message
Joined: 9 Nov 15
Posts: 8
Credit: 310,778
RAC: 0
Message 62444 - Posted: 19 May 2020, 22:05:08 UTC - in response to Message 62443.  

Or you can put it in VirtualBox, as some other projects have done, which should avoid both file system and library issues.
ID: 62444 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 62445 - Posted: 20 May 2020, 7:11:34 UTC - in response to Message 62444.  
Last modified: 20 May 2020, 7:26:42 UTC

Or you can put it in VirtualBox, as some other projects have done, which should avoid both file system and library issues.


Virtual box would as you say solve the problems of systems missing the 32bit libraries. However, there would be some performance hit and from time to time over on the BOINC boards I see users who have problems with it so it adds another layer where problems might occur. I don't know how straightforward it would be for the people at the project to set up virtual box for the Linux applications either and whether anyone there has experience of doing so.

Because of other reasons, I am going to do a clean install of Ubuntu on my laptop when the work currently on it is finished and will try using XFS to test it but, it is not a fast machine so it is likely to be over a month till I do so.

If there is anyone here using XFS who is either running tasks successfully or has the same problem it would be good if you could post to help us sort this one out and at least either confirm or disprove that XFS is the root of the problem.

Edit:If the XFS file system does prevent things working, I am a bit surprised nothing has come up on the BOINC forums when I did a search there.
ID: 62445 · Report as offensive     Reply Quote
LinAGKar

Send message
Joined: 9 Nov 15
Posts: 8
Credit: 310,778
RAC: 0
Message 62446 - Posted: 20 May 2020, 9:50:42 UTC - in response to Message 62445.  

Keep in mind that that 64-bit inodes are only used for file systems bigger than 1 TiB, and AFAIK only Inodes that are not in the first 1 TiB of the drive will be too large to fit in 32-bit. So the problem likely won't happen on a near empty file system, and it will never happen on file systems smaller than 1 TiB. Testing it may require filling the file system with 1 TiB of data first.

I'm not sure how common large XFS file systems are, especially for /var/lib, where AFAIK the boinc directory is by default on most distros. So this could be a pretty rare issue. But it based on you first comment, it does seem to happen sometimes.
ID: 62446 · Report as offensive     Reply Quote
LinAGKar

Send message
Joined: 9 Nov 15
Posts: 8
Credit: 310,778
RAC: 0
Message 62447 - Posted: 20 May 2020, 18:49:07 UTC - in response to Message 62446.  

LD_PRELOAD seems to work. What I did was:


    Compile inode64.c from the link above based on the instructions in that file.
    Place it at /usr/local/lib/inode64.so
    Add LD_PRELOAD=/usr/local/lib/inode64.so to /etc/sysconfig/boinc-client (EnvironmentFile in the boinc-client systemd service points to this).



Now, CPDN is running nicely.

ID: 62447 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4540
Credit: 19,039,635
RAC: 18,944
Message 62448 - Posted: 20 May 2020, 20:07:46 UTC - in response to Message 62447.  

Thanks for posting a solution.

I got confirmation from Richard, over on the BOINC forums that this almost certainly was a problem with the CPDN setup. I have informed the project so see if someone knows how to fix it their end.

I worked out after I last posted that with my system disk only being a 500MB SSD and my data disk being 1GB mechanical that I probably wouldn't see the problem. I will post your solution over on the BOINC forums in case anyone who reads them needs it.
ID: 62448 · Report as offensive     Reply Quote

Questions and Answers : Unix/Linux : Failing tasks with exit code 12 and 25

©2024 cpdn.org