My time in the land of SIMD (Altivec vs SSE)

For the last few weeks I’ve been playing around using Single Instruction Multiple Data (SIMD) instruction sets. More specifically I’ve been trying to do a very high level, basic comparison of Altivec and SSE as it seems like something that is both intersting and relevant considering Apples imminent move to Intel.

However, this seemingly simple comparison has a number of problems at the practical level. Firstly, there is no chip in production that supports both Altivec and SSE, nor does it appear as if there ever will be. This immediatly rules out a direct comparison. Secondly, and this is obvious but still a necesary point, there is no single architecture (ie x86, ppc etc) that has Altivec and SSE chips, making comparison even trickier.
Finally thie clincher, I’m an impoverished uni student so any grand hopes of testing accross many different setups was always going to be impossible.

The Plan
Write a simple program that serves very little purpose other than to stress the SIMD units of various chips. Obviously repetitive is the name of the game here as SIMD units come into their own when dealing with unrolling loops. In this case a pi generator was chosen and all compiling would be done in gcc-4.0, no direct ASM code was to be written.

The Hardware
Obviously I needed test beds capable of running Altivec and SSE. Altivec was covered with both my Mac Mini (G4 1.4ghz) and the iMac (G5 1.8ghz). SSE was taken care of by my trusty P4 2.8 (Prescott). There was, unforunately a significant clock speed difference but it was unavoidable as I do not have access to faster ppc machines or a slower x86 (with unix)

What does it do?
OK so I wrote a little pi generator that uses basicly the most inneficient method of calculating pi, the series 1-1/3+1/5-1/7 … ~= pi/4. Initially I wrote up a version that uses the CPU alone. No unrolling of loops, no ‘normal’ optimisations, just basic, raw CPU grunt. I then wrote up versions of this that do their work with Altivec and SSE packed vectors. These two are almost identical except for the intrinsic names (These vary between Altivec and SSE) and 1 other, probably important detail. The SSE instruction set contains a hardware divide function whereas Altivec relies upon a software implementation of this.
The Macs were both running OSX 10.4 (Tiger) whilst the P4 has been tested in linux (Ubuntu, Hoary) and the hacked Intel version of OSX.
The pi generator itself performs a 128,000 iteration loop 1000 times in order to complete this testing.

The results
OK, the results amazed, confused and annoyed me. Instead of getting the nice spread of results I was hoping for, I came out with 1 clear leader. The P4 blew the other two machines apart, both in linux and OSX. By blew away, I mean the other two couldn’t even come close. The raw CPU time for the P4 was nearly twice as fast as the Altivec enhanced time of the Mini. Of course I expected the P4 to be faster in raw CPU due to its higher clock speed, but I did NOT expect it to be faster than the Altivec version, at least not by such a significant margin.
Unfortunately the G5 decided to pack it in during testing. It gave results that were (considerably) slower than the mini and then crapped itself. Its back with Apple as I write this.
I will be posting ‘exact’ results tomorrow for this, though roughly this is how it broke down:
G4 1.4 PPC Raw CPU: 7 seconds
G4 1.4 PPC Altivec: 2.5 seconds
P4 2.8 Raw CPU: 1.5 seconds
P4 2.8 SSE: 0.8 seconds

There was practically no difference between the P4 in linux and OSX (They use the same header files etc so no real surprises). The difference between the Altivec and SSE times really amazed me though. The code is practically the same in each case (just changed for the platform) and yet the difference is more than would be expected from the clock speed difference alone.
It should also be noted that the code on the P4 was using SSE2, NOT SSE3.

So in the hope of getting some results that are even slightly comparable, the next step is to underclock the P4 and try again. I also wish to try some older, slower x86 CPUs if I can get a hold of them.

OK, this whole thing was aimed at being a programming learning experience rather than a comparison. These are my major notes:
– Documentation for SSE is _terrible_! I’m sure its out there somewhere but I could find very little. Apple provide a small amount but even that is more related to migrating code (Altivec -> SSE) and there is not much on the additional features of SSE (ie the divide intrinsic and double precision variables)
– Altivec is a ‘nicer’ interface than SSE. I’m sure this is due to Apples influence but there is a lot less ‘ugliness’ about it compared to SSE.
– Its quite easy to get good improvement using both Altivec and SSE
– I’ve really only scratched the surface of SIMD and its something I’d like to play with more down the line. Adding to my list of things to play with when I’ve finished this damn year.

5 thoughts on “My time in the land of SIMD (Altivec vs SSE)”

  • Nice one! What is the name of that series that is used to calculate pi/4? And can I get a copy of the code to run on the Athlon64?

  • Brian Willoughby says:

    Did you meet all of the requirements of AltiVec, such as memory alignment, etc? You should probably share your source code if you’re going to share the results, just so people can point out any mistakes that make the AltiVec look bad.

    SIMD is incredibly complicated. There are many easy ways to make a single mistake and lose every advantage. That just in your source code. There are also operating system concerns that you must be aware of, or your performance will drop considerably. I’m not convinced that your results mean anything, which is fine, since you say as much yourself, but I worry that this page will become a source of misinformation.

    One final comment: SIMD is best used for processing streams of data like audio or video. Your example of calculating pi from scratch is interesting, but it’s hardly what those execution units were designed for. In other words, saying that the P4 blew away the AltiVec at something neither is primarily designed to do is not terribly accurate.

    • Good points, Brian. Furthermore, GCC, especially the version used, was vastly more-optimized for x86 than all other architectures combined (more maintainers etc.), so by neglecting direct SIMD code usage is to accept an even more inaccurate comparison is made.

      It could also be noted that, even so, for the purposes of the article, AltiVec brought up the processing speed by x3.5 as opposed to SSE2 bring up the processing speed by only 2x.

Leave a Reply

Your email address will not be published. Required fields are marked *