SIMD’ifying multi-platform math

Nowadays, with each platform (even mobiles) supporting some sort of SIMD registers and operations, it is crucial for every codebase/engine to fully utilize the underlying SIMD functionality. However, SIMD instruction sets, registers and types vary for each platform, and sometimes between compilers as well.

This blog post explains how a platform-agnostic SIMD implementation can be built, without having to port large portions of the codebase for each new supported platform.

Before we start discussing technical details, let us examine the problem at hand first:

  • Each platform supports a different instruction set, e.g. SSE2 on Windows, Altivec on Xbox360 and PS3.
  • Each platform’s instruction set offers different types to operate upon, e.g. __mm128 on Windows, __vector4 on Xbox360.
  • SIMD type arguments need to be passed to functions differently (pass-by-value vs. pass-by-reference).
  • Some platforms support additional compiler hints, e.g. __declspec(passinreg) on Xbox360 in order to avoid LHS penalties when passing arguments to functions.
  • Some operations like negate(), abs(), etc. might map to a single instruction on one platform, but not on the other – and vice versa.
  • Many engine-defined types like vectors, quaternions, matrices, etc. should make use of the SIMD functionality offered by the platform.

Of course, all of this can be worked around by tailoring each and every math routine in an engine to the needs of the underlying platform. However, this solution naturally leads to a lot of duplicated code, and makes the job of porting the engine to a new platform harder, and more time-consuming.

The solution that is used in Molecule allows for best performance, no code duplication, and only needs one implementation file to be changed in order to port the math module to a new platform.

Dealing with different types and function arguments

The first thing we can do in order to abstract the underlying types used by the SIMD instruction set is to use an ordinary typedef, like in the following example:

#if ME_PLATFORM_WINDOWS
  typedef __m128 float_simd128_t;
#elif ME_PLATFORM_XBOX360
  typedef __vector4 float_simd128_t;
#elif ...
  // other platforms omitted
#endif

This means that each function dealing with SIMD functionality uses float_simd128_t internally, and only ever returns objects of type float_simd128_t.

Because SIMD types should be passed by value on some platforms, and by reference on others, similar typedefs can be defined for SIMD arguments:

#if ME_PLATFORM_WINDOWS
  // pass by value on x86
  typedef float_simd128_t float_simd128_arg_t;
#elif ME_PLATFORM_XBOX360
  // pass by value on Xbox360
  typedef float_simd128_t float_simd128_arg_t;
#elif ...
  // pass by reference on other platforms
  typedef const float_simd128_t& float_simd128_arg_t;
#endif

Again, this means that instead of directly using float_simd128_t as arguments to functions, we use float_simd128_arg_t instead, like in the following example:

float_simd128_t Add(float_simd128_arg_t a, float_simd128_arg_t b);
float_simd128_t Sub(float_simd128_arg_t a, float_simd128_arg_t b);
float_simd128_t Mul(float_simd128_arg_t a, float_simd128_arg_t b);

This nicely solves the problem of having to deal with different types and arguments, without introducing wrapper classes which often add unnecessary overhead.

Different instruction sets

Having our typedefs ready, how would we actually implement e.g. a function which adds two vectors? Or a function adding two quaternions? One possible solution is the following:

float_simd128_t VectorAdd(float_simd128_arg_t a, float_simd128_arg_t b)
{
#if ME_PLATFORM_WINDOWS
  return _mm_add_ps(a, b);
#else
  // other implementations omitted
#endif
}

Urgh, not very nice. All the code is cluttered with #if/#endif preprocessor directives, and some of the code is duplicated across several translation units (adding two vectors is the same operation as adding two quaternions). Additionally, because the instruction sets differ, some of the operations might have to be emulated by using several instructions instead of one, which are then all spread across different places.

Like many programming problems, this one can also be solved by introducing another level of indirection – we need to build a low-level layer which abstracts the instruction set, and in turn use this layer when implementing vector, quaternion, matrix, etc. functionality. We should make sure that this extra layer doesn’t negatively affect performance, though.

Low-level SIMD layer

The low-level abstraction layer needs to accomplish two things:

  • Each instruction of the underlying instruction set must be made available.
  • Each instruction of other platforms’ instruction sets must be emulated.

The first point is a no-brainer, but the second point is crucial. If, for example, one platform offers a native dot()-instruction, but the other doesn’t, it needs to be emulated on all platforms not supporting it. Only then all the other functions and classes like vectors, quaternions, etc. can completely rely on the low-level layer in their implementation.

A simple example is the abs()-instruction, which is not natively supported by the SSE2 instruction set, but can be emulated by using a bit-trick instead:

float_simd128_t Abs(float_simd128_arg_t x)
{
  return _mm_andnot_ps(x, SIGN_MASK_X1Y1Z1W1);
}

By abstracting all the low-level functions (intrinsics) of the instruction set this way, higher-level functions like vector math now only rely on this layer:

inline vector4_t VectorMul(vector4_arg_t a, vector4_arg_t b)
{
  return simd::Mul(a, b);
}

inline vector4_t VectorMul(vector4_arg_t a, float s)
{
  return simd::Mul(a, simd::Set(s));
}

inline vector4_t VectorNegate(vector4_arg_t a)
{
  return simd::Negate(a);
}

inline vector4_t VectorDot(vector4_arg_t a, vector4_arg_t b)
{
  return simd::Dot(a, b);
}

inline vector4_t VectorLengthSqr(vector4_arg_t a)
{
  return simd::Dot(a, a);
}

inline vector4_t VectorFastNormalize(vector4_arg_t a)
{
  return simd::Mul(a, simd::FastRecipSqrt(simd::Dot(a, a)));
}

inline vector4_t VectorLerp(vector4_arg_t a, vector4_arg_t b, vector4_arg_t t)
{
  return simd::Lerp(a, b, t);
}

// other functions omitted

The last thing left to make sure is that the additional low-level layer doesn’t add any unnecessary overhead.

To inline or not to inline

The way this is done in Molecule is very simple, and yet efficient. Each function in the low-level layer is declared using either ME_SIMD_NATIVE or ME_SIMD_NON_NATIVE, which are simple #defines:

#define ME_SIMD_NATIVE                    ME_INLINE
#define ME_SIMD_NON_NATIVE                inline

In order to get the best possible performance, functions which directly map to any native function of the instruction set will always be inlined by using ME_INLINE, which forces the compiler to always inline the function.

Other functions which internally use more than one native function of the instruction set let the compiler decide whether to inline or not by using the standard inline keyword.

In addition to avoiding unnecessary overhead, this also has the added benefit of clearly showing whether an instruction is emulated or not, without having to consult the implementation.

Finally, the low-level layer now looks like this:

namespace simd
{
  /// Sets all 4 floats to a scalar
  ME_SIMD_NATIVE float_simd128_t Set(float s);

  /// Sets all 4 floats to scalar values
  ME_SIMD_NATIVE float_simd128_t Set(float x, float y, float z, float w);

  /// Adds two values, returns a+b
  ME_SIMD_NATIVE float_simd128_t Add(float_simd128_arg_t a, float_simd128_arg_t b);

  // ...

  /// Splats x to all 4 floats, returns a.xxxx
  ME_SIMD_NATIVE float_simd128_t SplatX(float_simd128_arg_t a);

  // ...

  /// Arbitrarily shuffles a value, use simd::shuffle constants as template arguments
  template <unsigned int T0, unsigned int T1, unsigned int T2, unsigned int T3>
  ME_SIMD_NATIVE float_simd128_t Shuffle(float_simd128_arg_t a);

  /// Interleaving methods
  ME_SIMD_NATIVE float_simd128_t InterleaveAxBxAyBy(float_simd128_arg_t a, float_simd128_arg_t b);
  ME_SIMD_NATIVE float_simd128_t InterleaveAzBzAwBw(float_simd128_arg_t a, float_simd128_arg_t b);

  // ...

  /// Negates x
  ME_SIMD_NATIVE float_simd128_t Negate(float_simd128_arg_t x);
}

Conclusion

With all of the math functionality like vector, quaternion, and matrix operations implemented in terms of the low-level layer, and different types and registers abstracted using simple typedefs, all that needs to be done when porting to a new platform are two simple steps:

  • Define typedefs for the instruction set types, and arguments to functions.
  • Implement all functions located in the simd-namespace.

And that’s it – every other module will automatically work, and benefit from using the SIMD functionality.

Advertisements

19 thoughts on “SIMD’ifying multi-platform math

    • A valid question! I did not use eigen3 in particular because of the following:

      1) I need only a ~5% subset of what eigen offers.
      2) eigen adds lots of stuff already implemented (static assertions, templates, …)
      3) Compile-times. They get bigger during development anyway, I’d rather not use a library which relies so heavily on expression templates. Performance improvement comes from using SIMD and paying attention to memory access patterns, not by getting rid of a temporary on the stack.
      4) I need the code to work on other platforms (PS3 SPUs), as well as non-disclosed platforms.

      eigen is certainly useful if you do a lot of number crunching, and need any of their advanced algorithms out-of-the-box. And it’s useful as a reference.

  1. Cool stuff Stefan. I did a similar article over at #AltDev awhile back though I used some templates to pull it off. I’d be interested to hear your thoughts on the implementation if you get a chance to look through it.

    Also have an article in the pipe with a SIMD valarray, using SSE and AVX. If you’re interested I can send you a link to the draft.

  2. Pingback: Pass by reference | .mischief.mayhem.soap.

  3. I really like your solution. However, I wouldn’t introduce a level of indirection. I feel that the Vector interface (VectorAdd, VectorMul, etc…) or the simd interface is already enough depending how extensive you want to be. You can get rid of the #ifdefs in functions by providing a clean header for each platform.

    • Thanks!
      One of the reasons why I chose to implement a simd namespace, which in turn is being used by Vectors, Quaternions, Matrices, etc., was to avoid code duplication (VectorAdd is essentially the same operation as QuaternionAdd).

      I see two alternatives to using this level of indirection:
      1) Implement a header for each platform *for each type*, e.g. all vector operations (VectorAdd, VectorMul, …) live in their own header – which then has to be implemented for each platform. This gets rid of the indirection, but is more work when porting to other platforms or adding functionality.
      2) Implement a header which uses macros for each underlying SIMD operation, e.g. a macro for _mm_set_ps1, one for _mm_add_ps, etc. This also gets rid of the indirection, and you only need to implement one header for each platform.

      I don’t mind the extra indirection via the simd namespace because the compiler will inline those calls anyway – at least I can make sure that it does by checking the disassembly.
      I get your point, but at the moment I’m not really a fan of both 1) and 2). Any alternatives?

      • >> One of the reasons why I chose to implement a simd namespace, which in turn
        >> is being used by Vectors, Quaternions, Matrices, etc., was to avoid code
        >> duplication

        Very good point.

        However, your vector/quaternion implementation could to take advantage of specialised instructions depending the platform (and then, may need to become platform specific). The dot product is the first example that comes to mind, but there are other issues like matrix multiplication (that can’t translate into a single simd instruction). If you are really careful about performances, you might end up having defines in your vector/quaternion class.

        I would add that if performances is your main concern, then you should leave aside generic implementations and start offering methods like batchMultiply, batchNormalize, etc… and make them platform specific. Aside from instruction sets, platforms differs a lot by their pipelines, latency, register sets, memory bandwidth and tons of technical details. You simply can’t take advantage of that with a generic implementation.

        Granted, this is bad on a lot of points: readability, maintainability, consistency, etc… So yeah, you really need a good reason to go that way.

      • With the extra indirection via the simd namespace, vector/quaternion implementations are already taking advantage of specialised instructions. As you said, the dot-product is a perfect example of this: Dot() in the simd-namespace translates to several instructions on SSE2, but one instruction on SSE4 (and other architectures). Free functions like VectorAdd, VectorMul, etc. allow you to write high-performance algorithms without having to touch platform-specific instructions – that’s what the simd namespace is for.

        As an example, the real-time radiosity system only uses VectorAdd, VectorMul, and the likes, and no classes or similar. The disassembly boils down to the same code as if I had written all the intrinsics by hand. Of course I made sure that everything can be accessed linearly in memory amongst other optimisations, but that’s not the point.

        Maybe it didn’t come across from reading the post, but my current setup is the following:
        low-level: namespace simd: contains free functions which are force-inlined if they map to a single native instruction, and inlined otherwise (e.g. Dot() is force-inlined on SSE4, but inlined on SSE2)
        mid level: namespace math: contains free functions like VectorAdd, VectorMul, QuaternionAdd, etc. All of these functions use functions from the simd namespace, nothing else. This avoids code duplication while allowing for high-performance.
        high-level: namespace math: also contains classes like Vector3f, Vector4f, Quaternion, etc. which can and should be used in non-performance critical things, mainly gameplay stuff or similar.

        So the only platform-specific things are the functions in the simd namespace, all others automatically make use of specialised instructions, even with the extra indirection (it’s only an inlined/force-inlined “function call”).

        And I certainly agree that you need specialized functions for working with larger amounts of data, e.g. transforming particles, skinning, etc.

  4. Nice information, just one question. You write that the Dot is force-inlined in SSE4 and inlined in SSE2. Does that mean that you have a setup similar to:

    #ifdef SSE4_SUPPORT
    namespace simd
    {
    ME_SIMD_NATIVE float Dot( float_simd128_arg_t a );
    // …
    }
    // Or just #include “SSE4_simd.h”
    #elif SSE2_SUPPORT
    namespace simd
    {
    ME_SIMD_NON_NATIVE float Dot( float_simd128_arg_t a );
    // …
    }
    // Or just #include “SSE2_simd.h”
    #elif …
    #endif

    Or can you avoid code-duplication for each architecture?

    Cheers,
    Markus

    • Thanks Markus!
      I prefer something like the following:

      #ifdef SSE4_SUPPORT
      #define ME_SIMD_DOT_INLINE ME_SIMD_NATIVE
      #elif SSE2_SUPPORT
      #define ME_SIMD_DOT_INLINE ME_SIMD_NON_NATIVE
      #endif

      namespace simd
      {
      ME_SIMD_DOT_INLINE float Dot( float_simd128_arg_t a );
      }

      Works good as long as there aren’t that many instructions that change between instructions sets. But on Windows it can also become a mess, thus there’s always the option of doing completely separate include-files.

      • I’ve got another question regarding different SSE version support:

        SSE version support is decided at compile time, so an engine built whit the SSE4_SUPPORT flag requires CPUs that support SSE4 and above, but not up to SSE2, for example. And, an engine built with the SSE2_SUPPORT flag will work on all CPUs with support for SSE2 and above, but may underperform on SSE 4 class CPUs. Am I right?

        So, what do you do when releasing a game? Decide for a minimum spec (e.g. SSE2) and build the engine for SSE2 even if it’s underperforming in some, maybe a large number of, CPUs?

        Btw, thank you for your posts! I find them really enlightening and a really good reference to look to when building parts of my engine!

      • Thanks Marc, appreciate the feedback!

        The system I now have in place looks like the following: there are a bunch of #defines, one for each supported feature set, e.g.:

        #define ME_SIMD_MATH_SSE_2 1
        #define ME_SIMD_MATH_SSE_3 2
        #define ME_SIMD_MATH_SSSE_3 3
        #define ME_SIMD_MATH_SSE_4 4

        The macros are defined like this to make #if/#endif clauses easier. With one additional define (ME_ENABLE_SIMD_MATH in Molecule), the user can decide on the minimum feature set required by the application.
        That is, the user just has to define e.g.
        #define ME_ENABLE_SIMD_MATH ME_SIMD_MATH_SSE_4
        in order to enable SSE4 support. Like you pointed out, this means that the application will only run on CPUs with SSE4 support.

        Code-wise, this makes #if/#endif clauses for different feature sets trivial. As an example, having a function whose implementation differs between SSE2 and SSE4 can be written like this:

        #if ME_ENABLE_SIMD_MATH >= ME_SIMD_MATH_SSE_4
        // SSE 4 and higher
        #elif ME_ENABLE_SIMD_MATH >= ME_SIMD_MATH_SSE_2
        // SSE 2 and higher
        #endif

        This is all defined in the library, and is basically of no concern to the user, except for the definition of ME_ENABLE_SIMD_MATH.

        Regarding the release of a game, I mainly see two options:
        1) Decide for a minimum spec, and release your game with e.g. mandatory SSE2 support (which is quite reasonable, every CPU nowadays should support it).
        2) Build different executables for your game (one with SSE2, one with SSE4, one with AVX), and let a separate application launch the correct version depending on the CPU instruction set supported. Of course, this leads to more builds that have to be maintained as well as tested.

  5. Hi. Really like your blog. Learned a lot. Couple of questions. In previous replies you mentioned that you have procedural interface for high performance and class based interface for non performance critical code e.g game play. Does this class interface also use simd instructions or is the math implemented in a classical “scalar” way.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s