climateprediction.net (CPDN) home page
Thread 'Processor specific optimization?'

Thread 'Processor specific optimization?'

Questions and Answers : Wish list : Processor specific optimization?
Message board moderation

To post messages, you must log in.

AuthorMessage
old_user65201

Send message
Joined: 20 Mar 05
Posts: 1
Credit: 161,028
RAC: 0
Message 15043 - Posted: 11 Aug 2005, 14:07:55 UTC

hello,

I couldn\'t find any information on this anywhere, so I decided to post this question:

Is the HADSM client 100% Fortran code? If so, is there any room for processor specific optimizations such as SSE2 SIMD instructions? Or are such optimizations perhaps already present in the code? I imagine that this might help in chipping off as much as 30% of the many days required to process a workunit.

Best Regards,

M Rietveld
ID: 15043 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 15044 - Posted: 11 Aug 2005, 15:10:00 UTC

There are 2 parts to the code, which you can see in Task MAnager. One is Fortran, the other C.
The fortran is over 50 megabytes, and over a million lines of code.
The original was written by lots of researchers over many years, and runs on The Met Office 64bit super computers.
According to Carl, one of this project's programmers, it took a couple of years to modify it to work on desktops.
I think that most of the optimising you mention would be in the compiler, and that the two programmers have got the code tweaked as much as possible.
They continue to work on it to fix problems, but most of these now are to do with the BOINC part.

There is some unintentional processor specific optimising, caused by the Intel compiler.
See <a href="http://www.climateprediction.net/board/viewtopic.php?t=3082"> this</a> thread for info.

ID: 15044 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 15047 - Posted: 11 Aug 2005, 15:56:40 UTC

The SSE and SSE2 optimizations are being used on the Intel processors, but the version of the Intel Fortran compiler used to compile hadsm3 disables SSE type optimizations if it sees other than an Intel CPU running the compiled program. The project now has a later version of the Intel Fortran compiler which does not hobble the AMD chips so much. Later versions of hadsm will no doubt have optimizations for AMD chips, but when this will happen is unknown.
ID: 15047 · Report as offensive     Reply Quote
staffann

Send message
Joined: 23 Oct 05
Posts: 22
Credit: 526,746
RAC: 0
Message 16979 - Posted: 4 Nov 2005, 21:47:43 UTC - in response to Message 15047.  

The SSE and SSE2 optimizations are being used on the Intel processors, but the version of the Intel Fortran compiler used to compile hadsm3 disables SSE type optimizations if it sees other than an Intel CPU running the compiled program. The project now has a later version of the Intel Fortran compiler which does not hobble the AMD chips so much. Later versions of hadsm will no doubt have optimizations for AMD chips, but when this will happen is unknown.


I hope optimisation for AMD processors happen soon! For seti@home my Athlon X2 3800+ got half the processing time for a WU when I downloaded optimised code. Crippling AMD chips this way is really not the way to get forwards!
ID: 16979 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 16989 - Posted: 5 Nov 2005, 23:32:07 UTC

Probably not for a few months, until experiment 2 is underway.
And it won\'t be a special version just for AMD computers, but a general version that somehow gets around the builtin Intel discrimination.

ID: 16989 · Report as offensive     Reply Quote
Profileold_user156196

Send message
Joined: 1 Feb 06
Posts: 2
Credit: 46,630
RAC: 0
Message 20091 - Posted: 10 Feb 2006, 12:03:27 UTC - in response to Message 15044.  

Hello Les,

I joined CPDN a few days ago and it annoys me to no end how long a work unit will take and how much processing power is wasted by the way it was programmed. (OK the last is a guess and may be wrong.)

You confirm in that post what I feared - using a supercomputer program on a desktop. That’s like trying to pull a 40 tons truck trailer with a VW Beetle.

FORTRAN is certainly way faster than C. Still I do not understand why you not release optimized code for AMD and Intel based as well as 32 and 64 bit CPUs.
After all this would greatly improve the return rate. Still I think that’s not the biggest point for optimization.

50 MB code? Yes I saw the process is that big but that’s just what is not working on a desktop CPU. I do not know the memory size, access speed and bandwidth of your supercomputer but my Athlon 64 3000+ @ 1.8 GHz does not have 50 MB (even L2) cache. This means in most cases it will go to RAM which is from my calculations at least 15 cycles from the CPU (probably more - up to 300 we learned). That\'s a lot of waiting time if you ask me. Then there are those constant HDD accesses. Does the program read or write and does it have to wait for the access to be finished? I have 1 GB RAM but with such huge programs I would not be surprised if at least part was moved to the page file. So I tired to switch of the paging file but the HDD accesses were still there.
Is this really necessary? I mean my HDD has 8ms average access time -> this means for my system 14,400,000 CPU cycles of waiting time.

Here are my questions/suggestions.

1. Is it really necessary to have all 5 phases done in one WU?
If all were of the same processing time that would cut each unit to 10 days instead of 50 on my machine without any other optimization.
I assume that not all 50 MB are used in each phase. So cutting it in parts would make the process smaller and less likely to be put to disk.

2. Does each phase have to cover the whole time? Would it not be possible to just calculate a year or even just a months (so a work unit would not take more than about 5 hours hopefully less). The result would be send out again to be processed further till the phase is finished and then again till all phases are finished. If you could cut it to small functional phases so that the program code would mostly stay in cache and the data as well this would speed up to whole processing immensely.
I also believe a lot of people get scared away but the long processing time. While for a supercomputer user this may seem fine it looks just horrible to a desktop user. So with smaller WU you would likely get more people contributing to your project.

3. Is the Animation part of the 50 MB? My guess is it’s the other 7 MB of C code but why is it in memory when I run no animation. I would like to help research but I do not really understand it so it’s just a nice animation. Something I can live without. But even if one wants it should not be part of the model. I just say Model-View-Controller design-pattern.
Do not load what is not needed at a certain processing stage. Make the whole thing more modular so the OS can work with it better.

4. I wrote to nVidia if they are aware of BOINC and that their GPUs are quite well suited for the kind of problems research applications present. A GPU maybe up to 10 times as fast as a CPU given the right tasks.
Most users do not play all the time but still have a fast 3D GPU in there system. So this GPU is mostly doing nearly nothing. Maybe it would be possible to write graphics card drivers that would allow BOINC and its applications to run on the GPU idle time and the CPU idle time. GPUs should be good at vector processing and I guess you use vectors.
Here is the answer I got:

Hi Holly,

I will pass this along. Good idea.

You should lob it in to this group, too.
http://www.gpgpu.org/

Brian Burke
NVIDIA Corp.
12331 Riata Trace Parkway, Suite 300
Austin, TX 78727

Well I have no idea if they will decide that the possible costs will be worth the possible PR effect or not. So its not clear, if they actually will try to make it work, but it’s cool that they even consider it.
Maybe you want to see if you could tab into those resources as well.

Ok that was more than enough I guess.
I would like to hear your answers. What is possible? Is it mainly a question of financing such changes? (Would not surprise me - sadly)

One thing I want to say for the end. It’s great that you work on this and show how climate is developing. Hopefully the governments will see that they need to act now and hopefully it’s not to late already. Germany is not that bad in trying to be not so invasive with the environment but when I heard Sweden’s commitment yesterday I have to say we are light years behind since money is still the stronger drive. Let’s not even thing about the USA’s point of view. But since you live there you know better than I do ;)

So please speed up your research and show what the consequences are.

All the best

Holly

ID: 20091 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 20101 - Posted: 10 Feb 2006, 16:49:51 UTC
Last modified: 10 Feb 2006, 16:55:50 UTC

When I said that the code was over 50Megs, I meant the source code. Sorry.
The compiled code that is run on pcs is smaller.

I\'m not part of the core team, just a user like you, who helps out with problems.
This project is run by Oxford University\'s Atmospheric, Oceanic & Planetary Physics. It is based on an idea of Dr Myles Allen, who is trying to see if it is possible to improve the accuracy of \'climate\' predictions, rather than \'weather\' predictions.

Splitting the models into smaller sections is not feasable, due to the nature of the calculations. All this is described in the FAQ, and the Climate Science sections, to the left of here.
Basically, the Earth\'s atmosphere is split into \'cubes\', called cells, and various values are calculated for each cell for a nominal half hour. Then these values are used as the starting point for calculating what might happen in the next half hour. This repeats for a 15 year period.
<a href=\"http://www.climateprediction.net/science/model-intro.php\"> This</a> part of the documentation explains it better.

To split it into smaller parts, each run by different people would mean a much larger transfer of data, in both directions. And different processors return slightly different results, due to the way the maths libraries are implemented in different brands of processor, and possibly in different models in the same brand.
All of this was considered, and rejected, at the start, several years ago. It has also been discussed on the \'Message\' board, formally called the Community board.

GPUs work by calculating lots of different, unrelated, values for display on various parts of a display screen, and so are not really suited for this project. This, too, has been discussed in the Message board.

And I live in Australia, not the USA.

ID: 20101 · Report as offensive     Reply Quote
old_user2467

Send message
Joined: 28 Aug 04
Posts: 90
Credit: 2,736,552
RAC: 0
Message 20103 - Posted: 10 Feb 2006, 17:33:46 UTC - in response to Message 16979.  

I hope optimisation for AMD processors happen soon! For seti@home my Athlon X2 3800+ got half the processing time for a WU when I downloaded optimised code. Crippling AMD chips this way is really not the way to get forwards!


If you are running sulphur models on your Athlon boxes, so have a look at this thread in the php-BB Message board:
http://www.climateprediction.net/board/viewtopic.php?p=32954#32954

Ananas has patched the sulphur executables so that for the sulphur executables on AMD boxes the SSE2 optimitions are available too.
ID: 20103 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 20104 - Posted: 10 Feb 2006, 18:21:46 UTC

The model running on the graphics card was discussed here.

The latest Windows code IS optimized for both Intel and AMD. So is the latest Linux, but it\'s currently unstable.

As for 64bit, that may be coming sometime, but I\'m not sure how boinc handles handing out different app types based on different operating system capabilities. It can distiguish Linux from Windows, but can it distiguish 64 bit vs. 32 bit capabilities from the various Linux kernels, or Windows? Perhaps but I don\'t know for sure.

Another potential optimization is compiling the app with the -parallel switch so that one model can run on a dual proc/dual core/hyperthreaded PC faster. Instead of those PCs possibly running two models at the same time, you could have the option of running one model faster. This was tested with the coupled spinup model, but was unstable. But the instability could have been due to other optimizations rather than the -parallel switch.
ID: 20104 · Report as offensive     Reply Quote
ProfilePooh Bear 27
Avatar

Send message
Joined: 5 Feb 05
Posts: 465
Credit: 1,914,189
RAC: 0
Message 20105 - Posted: 10 Feb 2006, 19:00:14 UTC - in response to Message 20104.  
Last modified: 10 Feb 2006, 19:00:34 UTC

Another potential optimization is compiling the app with the -parallel switch so that one model can run on a dual proc/dual core/hyperthreaded PC faster. Instead of those PCs possibly running two models at the same time, you could have the option of running one model faster. This was tested with the coupled spinup model, but was unstable. But the instability could have been due to other optimizations rather than the -parallel switch.


This has me a bit intreged, but confused on how it will work when you are running multiple projects. I have a HT Dual Processor, how would it handle that? Each processor would get a WU, or would it parellel across all 4 processors (2 real, 2 virtual)?

Again, it confuses me how it would be handled when other projects are selected. Would it know to stop the other processes to run paralleled? Or would a new BOINC version have to be created first?

I really would love to see this across more projects. It might give better throughputs on those pesky HT Xeon chips that have issues running 2 of the same type of WU on the main and HT of the processor.

This is a very interesting avenue to go down. I hope it becomes a reality.


ID: 20105 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 20107 - Posted: 10 Feb 2006, 19:21:28 UTC - in response to Message 20105.  

This has me a bit intrigued, but confused on how it will work when you are running multiple projects. I have a HT Dual Processor, how would it handle that? Each processor would get a WU, or would it parellel across all 4 processors (2 real, 2 virtual)?

Again, it confuses me how it would be handled when other projects are selected. Would it know to stop the other processes to run paralleled? Or would a new BOINC version have to be created first?

I am unsure, but have wondered the same thing. BOINC would have to be smarter. I could see this easily if all you were running was one project, but multiple projects otherwise pose problems. When Carl tried it, it was with the idea of running spinup beta in the fastest time possible. Multiple project environment...much trickier.

At work, we run a very high resolution computer model of the atmosphere out to 36 hours in Linux. The compilation for parallel processing works quite well for a dual processor system, but hypertheaded CPUs appear to not gain anything at all. But that could be in the design of the model, rather than the fact that HT couldn\'t produce some benefits in a parallel processing computer model.
ID: 20107 · Report as offensive     Reply Quote

Questions and Answers : Wish list : Processor specific optimization?

©2024 cpdn.org