Implementing a semi-automatic structure-of-arrays data container

In performance-sensitive applications like games it is crucial to access data in a cache-friendly manner. Especially when dealing with a large number of objects of the same type, e.g. individual components in an entity-component-architecture, we should make sure to read as little data as possible. However, simple arrays-of-structures are often not suited for this, with structures-of-arrays yielding better performance. But the latter are not natively supported by the C++ language.

Array-of-structures (AOS) vs. Structure-of-arrays (SOA)

Before we discuss the inner workings of a possible solution, we need to explain the problem at hand, which can best be illustrated with an example.
Consider the following component:

struct Component
{
  // component data here
  AnyType member0;
  AnyType member1;
  // ...
  bool m_isVisible;
};

For illustration purposes, let us assume that the Component struct has additional members not shown here, so that sizeof(Component) yields 64 bytes. If we want to store an array of such components, we can either use plain arrays, or an std::vector. Both store the data in a so-called array-of-structures (AOS), which means that all instances of Component will be stored next to each other in memory, like so:

components[0] -> [ member0, member1, ..., m_isVisible]
components[1] -> [ member0, member1, ..., m_isVisible]

What does this mean for data accesses? Imagine that we want to render all of those components in one go, but in order to reduce the number of draw-calls submitted to the rendering API, we want to check whether the component is really visible:

for (size_t i=0; i < count; ++i)
{
  if (components[i].m_isVisible)
  {
    // render component here
  }
}

Even though it doesn’t look like it in the C++ code, we are reading a lot of unnecessary data whenever a component is invisible. Every time we access the member components[i].m_isVisible, we pull in the whole cache line that maps to the corresponding address in memory. So even though a component might not be visible, we still access all its data, reading 64 bytes instead of just 1 byte!
If we need to touch all the components’ data because they are all visible anyway, then it doesn’t really matter. But in the average case, we read a lot of unnecessary data.

A simple solution is to store the data using a structure-of-arrays (SOA) approach:

struct Component
{
  // component data here
  AnyType* member0;
  AnyType* member1;
  // ...
  bool* m_isVisible;
};

Note the pointers: instead of storing an array of Components, we store arrays of individual members next to each other. The memory layout of the data then becomes:

member0 -> [ component0, component1, ..., componentN]
member1 -> [ component0, component1, ..., componentN]
m_isVisible -> [ component0, component1, ..., componentN]

The code doesn’t have to change much, we only need to move the array access [i]:

for (size_t i=0; i < count; ++i)
{
  if (components.m_isVisible[i])
  {
    // render component here
  }
}

This simple change ensures that we only read the data we need. When we check components.m_isVisible[i], we also pull in all the boolean values for the other 63 components on that cache line, but none of the other members as long as we don’t touch them in code.

However, in practice it’s not just a simple change like in the code shown above. The surrounding code needs to be changed as well. Memory needs to be allocated differently, members need to be accessed differently, etc.
Which brings us to the topic of this post: what we would like to have is some sort of dynamic array that stores all its elements in SOA-fashion internally, but offers an interface similar to plain arrays or std::vector.

A non-generic first try

To keep things simple enough for this post, we only want to concern ourselves with POD data, and not deal with placement new and manual destructor invocation. You need this in production code as soon as you want to store non-PODs in such a container, but we want to concentrate on the more crucial parts of the code in this post.

For a non-generic, non-templated first stab at an implementation, let us store the following struct in an SOA manner:

struct Item
{
  int m_int;
  float m_float;
};

The most interesting methods we need in our implementation are Add() and Remove(), which are simple in this case because we know the data types we’re dealing with:

class ArraySOA
{
public:
  void Add(const Item& item);
  void RemoveAt(size_t index);

private:
  int* m_data0;
  float* m_data1;

  size_t m_size;
};

void ArraySOA::Add(const Item& item)
{
  m_data0[m_size] = item.m_int;
  m_data1[m_size] = item.m_float;

  ++m_size;
}

void ArraySOA::RemoveAt(size_t index)
{
  const bool isLast = (index+1u == m_size);
  if (!isLast)
  {
    m_data0[index] = m_data0[m_size-1u];
    m_data1[index] = m_data1[m_size-1u];
  }

  --m_size;
}

As I said – POD only, simple implementation.

Adding an item is done by storing the item’s data in the corresponding arrays, and removing an item is done by swapping all members with the last one in the array. This changes the order of items, but keeps the array contiguous (this is different to std::vector).
Let us now increase the difficulty by changing the implementation to deal with structs having an arbitrary number of members. Up until now, we knew that we needed an array of ints and an array of floats. But what if the struct Item has 3 members? Or 4?

It quickly becomes apparent that we cannot store any number of arrays in the way we did. We need a more generic version, and an array of pointers-to-void for storing the start of each individual array fits the bill:

class ArraySOA
{
private:
  void* m_array[MEMBER_COUNT];
  size_t m_size;
};

As long as we only store the pointers to the individual arrays and cast them back to their original type, everything’s fine. Assuming we could simply do that in C++, the code for adding an item would then look like this (pseudo-code-ish):

void ArraySOA::Add(const Item& item)
{
  for (size_t i=0; i < MEMBER_COUNT; ++i)
  {
    OriginalType* data = cast_to_original_type(m_array[i]);
    *data = item.access_member_i;
  }

  ++m_size;
}

There are two fundamental problems:

  • How do we cast back to the original type of the array (e.g. int*, float*)?
  • How do we access the i-th member of an item without knowing the member’s name?

We can tackle both problems by using templates and template specializations, introducing our own type traits.

Retrieving the original type

In order to be able to retrieve the original type of the array, we can provide a class template that defines an inner type that corresponds to the original type of a member in the struct. The base template can be left empty, and each struct needs to add a specialization for each of its members. In the case of our simple struct Item, this means that we could do the following:

// empty base template
template <typename T, size_t N>
struct GetType;

// template specialization for Item, 0
template <>
struct GetType<Item, 0> { typedef int Type; };

// template specialization for Item, 1
template <>
struct GetType<Item, 1> { typedef float Type; };

This means that the expression GetType<Item, 0>::Type would yield the type int, and GetType<Item, 1>::Type would yield the type float. Before we can put this to use, we need to find a way to access individual members of the Item struct without knowing their name in the for-loop.

Pointers-to-members in C++

In addition to pointers-to-functions and pointers-to-member-functions, in C++ there is also a thing known as pointers-to-members. In simple terms, a pointer-to-member allows you to store a pointer to a data member of a class, and read/write values of any instance of that class through that particular pointer. Regarding the effect it achieves it’s somewhat similar to using offsetof() in C, but a pointer-to-member does this in a type-safe manner. Therefore, always keep in mind that a pointer-to-member is not simply an offset from the start of the class – it could very well be, but in the case of multiple and/or virtual inheritance, such a pointer carries more information (which is defined by the implementation). Hence, it shall never be treated as a simple integer, but needs to be stored and accessed with the proper type.

Using our GetType template introduced above, we can now add a function template that yields the proper pointer-to-member of the Item struct, again specialized for individual members:

template <typename T, size_t N>
struct GetPointerToMemberType
{
  typedef typename GetType<T, N>::Type T::*Type;
};

template <typename T, size_t N>
static typename GetPointerToMemberType<T, N>::Type GetPointerToMember(void);

template <>
static typename GetPointerToMemberType<Item, 0>::Type GetPointerToMember<Item, 0>(void)
{
  return &Item::m_int;
}

template <>
static typename GetPointerToMemberType<Item, 1>::Type GetPointerToMember<Item, 1>(void)
{
  return &Item::m_float;
}

This may seem confusing at first, but bear with me.

The class template GetPointerToMemberType is a helper that makes dealing with the awkward pointer-to-member syntax easier. It simply takes the type defined by the GetType template, and defines an inner type that is a pointer-to-member of that type, in the given class. This means that when GetType<Item, 0>::Type yields type int, GetPointerToMemberType<Item, 0>::Type will yield a pointer-to-int-member for the class Item. Similarly, template arguments <Item, 1> will yield a pointer-to-float-member type.

This in turn allows us to define a function template named GetPointerToMember that returns the proper pointer to an Item‘s member. Again, the function template is specialized for template arguments <Item, 0> and <Item, 1>.

Implementing the loop

Armed with those templates, we can now try to rewrite the for-loop we had in a generic way:

void ArraySOA::Add(const Item& item)
{
  for (size_t i=0; i < MEMBER_COUNT; ++i)
  {
    typedef typename GetType<Item, i>::Type DataType;
    DataType* data = static_cast<DataType*>(m_array[i]);
    data[m_size] = item.*(GetPointerToMember<Item, i>());
  }

  ++m_size;
}

However, we have run into a fundamental problem: we cannot use a runtime variable i as compile-time template argument! We have to come up with some sort of compile-time for-loop which allows us to turn a compile-time expression into any number of function calls, like so:

ForEach<MyFunction, 3>();

should call

MyFunction<0>();
MyFunction<1>();
MyFunction<2>();

Note how a call of the function template ForEach is able to expand into several calls of a given function template, with the counter of the for-loop being passed as non-type template argument. But how do we implement such a ForEach function template? For that, we need a tiny bit of template meta-programming, but nothing too complicated.

A compile-time for-loop

The first thing we need is a function template that accepts an index as template argument, and calls another function template with that index, e.g.:

template <size_t INDEX>
void ForEach(void)
{
  MyFunction();
  ForEach<INDEX+1>();
}

Starting at ForEach<0>(), this would call MyFunction<0>(), MyFunction<1>(), and so on. However, there is nothing that stops the „loop“ right now. For that, we need to add a second template argument that denotes the maximum loop counter:

template <size_t INDEX, size_t COUNT>
void ForEach(void)
{
  MyFunction();
  ForEach<INDEX+1, COUNT>();
}

We could now use a template specialization in order to terminate the loop when INDEX == COUNT, but that doesn’t really work in our case because we want to have different ForEach calls with different loop counters.
Another option would be to add a third template argument that denotes whether the loop should terminate or not, and specialize the template for that argument. However, partial template specializations are not allowed on function templates, so that doesn’t work either.
There is a third option, however: type-based dispatching. By introducing two overloads of the ForEach function template, we can choose the correct one by supplying an unused temporary argument to the function call.

template <int N>
struct IntToType {};

typedef IntToType<true> ForEachDoNotTerminateLoop;
typedef IntToType<false> ForEachTerminateLoop;

template <size_t INDEX, size_t COUNT>
void ForEach(ForEachDoNotTerminateLoop)
{
  MyFunction();
  ForEach<INDEX+1, COUNT>(IntToType<(INDEX+1 < COUNT)>());
}

template <size_t INDEX, size_t COUNT>
void ForEach(ForEachTerminateLoop)
{
}

For an explanation of how the IntToType template works, see the original post. The important thing to note is that IntToType<false> and IntToType<true> yield two different types. Hence, IntToType<(INDEX+1 < COUNT)>() will yield a temporary of either type, which forces the compiler to undergo overload resolution, choosing one of our two overloads, depending on whether the expression INDEX+1 < COUNT was true or false. The two typedefs only exist to make the code more readable and clear.

And that is basically our compile-time for-loop. All we need now is another function template that kicks of the ForEach process:

template <size_t COUNT>
void ForEachFunction(void)
{
  ForEach<0, COUNT>(IntToType<(0 < COUNT)>());
}

So by calling ForEachFunction<3>, we get calls to MyFunction<0>(), MyFunction<1>(), and MyFunction<2>(). Just what we wanted!

Of course, in production code we need the ForEach function template to be much more generic, so that we can call both free functions and member functions with an arbitrary number of arguments. Additionally, we want to be able to specify which function should be called.
All of that adds quite a bit of boilerplate code and several more overloads, but is left as an exercise for the reader. The next version of the Molecule Core Library SDK will contain a full-blown implementation if you want to take a look.

Back to the start

Finally, we are able to correctly implement our Add() and Remove() methods for our SOA-style array:

void ArraySOA::Add(const Item& item)
{
  ForEachClassMethod<AddItem, ARRAY_COUNT>(this, item);
  ++m_size;
}

void ArraySOA::Remove(size_t i)
{
  const bool isLast = (i+1u == m_size);
  if (!isLast)
  {
    ForEachClassMethod<RemoveItem, ARRAY_COUNT>(this, i);
  }
  --m_size;
}

template <size_t N>
void AddItem(const Item& item)
{
  typedef typename GetType<Item, N>::Type DataType;
  DataType* data = static_cast<DataType*>(m_array[i]);
  data[m_size] = item.*(GetPointerToMember<Item, N>());
}

template <size_t N>
void RemoveItem(size_t index)
{
  typedef typename GetType<Item, N>::Type DataType;
  DataType* data = static_cast<DataType*>(m_array[i]);
  data[index] = data[m_size-1u];
}

Note how the code in the original loop has been moved into a function template, which is invoked by the ForEachClassMethod calls.

For a final implementation, there are a two more things left to do:

  • The ArraySOA class should be able to handle any struct type T, not just Item.
  • The ArraySOA class should offer methods for accessing elements both in AOS-fashion and SOA-fashion.

In Molecule, I dealt with the latter feature by using nested structs inside all types T. As an example, this is what a point light component looks like:

struct PointLightComponent
{
  struct Aos
  {
    math::vector4_t m_positionRadius;
    math::vector4_t m_diffuseColor;
    math::vector4_t m_specularColor;
    math::vector4_t m_shadowColor;
  };

  struct Soa
  {
    size_t m_count;

    math::vector4_t* m_positionRadius;
    math::vector4_t* m_diffuseColor;
    math::vector4_t* m_specularColor;
    math::vector4_t* m_shadowColor;
  };
};

The nested struct Aos is used for adding instances to the array, which offers the following interface:

template <class T>
void ArraySOA<T>::Add(typename const T::Aos& item);

For retrieving elements from the array, the class offers the following methods:

template <class T>
typename T::Aos ArraySOA<T>::GetAos(size_t index);

template <class T>
typename T::Soa ArraySOA<T>::GetSoa(size_t index);

template <class T>
typename T::Soa ArraySOA<T>::GetSoa(void);

GetAos() is useful if you only need read-access to the values stored in the array. Because all members are stored by-value, the data in the array cannot be changed.

GetSoa() returns the data of one particular entry in the array, which is useful for write-access because the struct’s members are stored by-pointer. Additionally, GetSoa() can return the whole data streams for all members, which is mostly used by individual component systems for bulk updates.

Last but not least, all the SOA-traits template specializations introduced in the beginning of the post have to be added for the struct as well, which is achieved using simple macros:

ME_DEFINE_SOA_CLASS(PointLightComponent, 4);
ME_DEFINE_SOA_TYPE(PointLightComponent, 0, math::vector4_t, m_positionRadius);
ME_DEFINE_SOA_TYPE(PointLightComponent, 1, math::vector4_t, m_diffuseColor);
ME_DEFINE_SOA_TYPE(PointLightComponent, 2, math::vector4_t, m_specularColor);
ME_DEFINE_SOA_TYPE(PointLightComponent, 3, math::vector4_t, m_shadowColor);

The macros do nothing more than declare the template specializations introduced earlier.

Wrapping up

To recap, our ArraySOA needs/uses the following things to get the job done:

  • Our own type traits that identify types and pointers-to-members of all members in a struct T.
  • Pointers-to-members for type-safe access to individual members of any struct T.
  • A compile-time for-loop that is able to call function templates with a compile-time counter.
  • Nested structs Aos and Soa to differentiate between the different kinds of access.

Performance-wise, the disassembly of the generated code boils down to the exact same code as if it was written by hand. This specifically means that e.g. ForEachClassMethod<AddItem, 3>(this, item) does exactly 3 assignments to the corresponding array streams, and nothing more. Neither the for-loop nor the pointers-to-members have any overhead when compiler optimizations are turned on.

And that concludes our implementation of the semi-automatic SOA-style array. Because storing elements in SOA-fashion is not natively supported by the language, we have to add quite a bit of extra declarations to each struct before it can be used, but that’s still better than having to write all that code by hand. I really would like to have a fully automatic implementation, but I don’t believe that’s possible without compiler support.

Advertisements

22 thoughts on “Implementing a semi-automatic structure-of-arrays data container

  1. Very nice – I love all this template magic. I had a similar idea in development in an old project of mine. The extra advantage with such a system is that you can extend the ME_DEFINE_SOA_TYPE macro to provide some basic reflection facilities, to help with automatic serialisation, and assignment of default values and valid ranges for your tools/editor UI.

  2. Very nice post. I also started working on a similar concept.
    Have you thought about going a step further and disecting vectors and matrices into single components for better SIMD parallelization

    • No, I haven’t thought about doing that with SIMD types.

      The reason for that is that simply transposing/swizzling types on load/store operations causes overhead, so ideally the content pipeline should write the data into a different binary format anyway. I’d rather touch the code then. Especially with SIMD types you want to work on a ton of data, and AoS- vs. SoA-code tends to be written very differently for SIMD-related data transformations. Consider doing simple lighting calculations Aos-style vs. Soa-style: in the former case, you can do all calcuations per light source, in the latter case you have to work with 4 light sources at once (or 8, if you’re using AVX). That really complicates things.

      However, offering helper structs like VectorAos and VectorSoa can be helpful, when they internally work with 1 and 4 vectors, respectively. Something akin to the ::aos and ::soa namespace in the Sony Math library.

  3. Thank you very much for this post. I’m thinking of migrating some parts of my code base to SOA usage but haven’t gained the proper momentum yet 🙂

    Regarding you example with visibility: doesn’t it make actually sense to store visible and non-visible objects in different collections?

    BTW, There is a very nice talk “Pitfalls of Object Oriented Programming” which also shows practical benefits of SOA usage.

    • Thanks!

      Regarding you example with visibility: doesn’t it make actually sense to store visible and non-visible objects in different collections?

      Yup, makes perfect sense. You can keep them in the same collection/array if you swap the visible objects to the front, but still update all of them each frame for updating their visibility.

      BTW, There is a very nice talk “Pitfalls of Object Oriented Programming” which also shows practical benefits of SOA usage.

      Tony’s talk is excellent. It is one of the first things I show my students when teaching them the whole meaning of cache-friendly data structures :). The slides have lots of relevant details.

  4. Hi Stefan, great article – as always – thanks for sharing!
    I have a question: how does exactly this data structure relate to the IdArray you mentioned on previous articles?
    The idea spinning through my mind is that you use the IdArray to create/destroy/lookup components, but, the ‘real’ data of those components resides either on the component itself, or in the system which manages the component in a SoA fashion; or both. For example:

    struct MeshComponent
    {
    VertexBufferId m_vb;
    IndexBufferId m_ib;

    uint32_t m_count;
    Matrix4x4* m_local_pose;
    AABB* m_aabb;
    bool* m_visible;

    };

    In such a design, MeshComponent holds two handles to rendering-related data and pointers to contiguous arrays of members. Am I correct or there’s something I’m missing?
    If so, how do you handle the case when a component gets deleted and you have to keep the pointers inside the components pointing to the right data?
    Have you ever had the need to sort the SoA? How do you update the pointers inside the components is such a case?
    Hope not asking too much, thanks in advance!

    • Thanks taylor!

      In the current design, the “real” data is always held by the corresponding system in SoA fashion. A component itself has pointers into the contiguous SoA-arrays, but whenever you create, destroy or lookup components you only deal with Ids (e.g. a MeshComponentId). This Id is the one unique identifier for a particular component. Using this Id, the user can lookup the corresponding component (this uses the IdArray mentioned in previous articles), and gets back a component instance – however, this instance is valid for one frame only. It cannot and should not be cached/stored anywhere, this is what the Ids are for.

      When deleting a component, the IdArray automatically takes care of updating the internal data structures so that all existing Ids still refer to the correct objects. The SoA data uses the swap-with-last trick to keep all instances contiguous.

      I hope that answers all your questions.

      • Hi Stefan, thanks for the answer!

        However, after the deletion of a component, the IdArray takes care of updating internal ‘sparse-to-dense’/’dense-to-sparse’ tables, but, because of the SoA swapping trick, the pointers inside the component referring to the last SoA element are now pointing to bad data. How can you update those pointers?

        Thanks in advance

      • Short answer: you don’t.
        Long answer: individual component instances that are retrieved using a lookup into the IdArray are only valid for one frame. They should not be cached or stored, but rather looked up using the Id each frame.
        This is similar to how you cannot store an iterator into a vector if you delete elements from that vector, or how you cannot store an offset into an array if you delete an element from that array. You will always end up accessing the “wrong” data.
        Therefore, using components is generally done as follows:

        // component creation
        MeshComponentId id = world.CreateComponent(initialData);

        // component lookup
        MeshComponent comp = world.LookupComponent(id);
        *comp->member = ...;

        The only thing you keep/store is the Id, which is basically an opaque datatype that can be used to lookup the corresponding component. Accessing individual components from “outside” therefore is a bit slower than usual because of the added indirection in the IdArray, but systems will always access *all* component data in SoA fashion, which has huge benefits.

      • Hey 🙂

        You say it’s valid for one frame only, does this mean that you defer the destruction of the components until the end of the frame?
        Because if it’s valid for one frame I would expect that my components still point to the correct data even if I delete another component of the same type and the SoA “swap with last element and decrement size” is not going allow this, unless we defer destruction?

        Cheers for another amazing post!

      • Yes, that’s true. If retrieved components have to stay valid for one frame, destruction has to be deferred.
        Another alternative is to destruct components immediately, but then all accesses to any component must always be done using an Id-based lookup.

  5. Indeed very nice topic! Thanks for sharing this with us!
    But I still can’t wrap my head around on how you did implement your “ForEachClassMethod” template? How it’s possible to pass a templated member method like in your example “AddItem” to another template method and let this one parametrize “AddItem” without using a functor?
    Am I missing smth or do we need to implement “AddItem”, “RemoveItem” and so on as a functor like this:
    template
    struct AddItem {
    void operator()(typename T::Aos& aos) { … }
    };

    Thanks in advance for helping me out with this template weirdness! 🙂

    • Yes, that’s how you have to do it, because you cannot take a pointer to a function template in C++ (the function template is a set of functions, not a single function). You can define a simple wrapper struct with a member template operator() though, and this operator just calls the real function, e.g. something like this:

      struct Wrapper
      {
      template <size_t N>
      void operator()(void) const
      {
      YourFunction<N>();
      }
      };

      ForEachClassMethod accepts any type T which has to offer an operator(), and the SoA-array implementation has one wrapper for each function it uses. It’s a slight inconvenience, but it is tucked away in the container implementation anyway.

      • Ok, but if I do it this way…
        struct AddItemWrapper
        {
        AddItemWrapper(typename const T::Aos& item) : _item(item) {}

        template
        void operator()(void) const
        {
        AddItem(_item);
        }

        typename const T::Aos& _item;
        };

        template
        void AddItem(typename const T::Aos& item)
        {
        typedef typename T::GetType::Type DataType;
        DataType* data = static_cast(_array[N]);
        data[_size] = _item.*(GetPointerToMember());
        }

        …you will get an “illegal call of non-static member function” error. So I think this wrapper have to be a little bit smarter so that it can accept a pointer to a member function, or am I missing smth?

      • No, it’s not that complicated. You just have to call the member function on an instance inside the wrapper, so just pass an instance to the wrapper. It’s a regular method call after all.

  6. Hi Stefan,

    Great post! I’m curious to hear about what kind of performance improvement you see when using AOS vs SOA. I did a quick and dirty implementation in C# (much easier than C++ thanks to reflection) and saw about a 25% improvement when doing a for loop that does an if() statement on one of the fields for all instances. Are you seeing a larger improvement than that? I could see that being the case since C# has some other safety measures when indexing arrays

    Also, what are your thoughts on using this idea in GPU programming or column oriented DBs?

    Thanks
    Mike

    • Hi Mike,

      It really depends on how much data you read, and what you do with that data. Sometimes the performance improvement can be as “little” as 10%, but I’ve seen improvements of more than 200% in other code I worked with. That is mostly the case when you’re doing the classic loop:


      for each instance
      if (SomeCondition(m_memberData))
      Do();

      If accessing m_memberData pulls in a lot of data, the performance improvement can be enormous.

      GPU programming also benefits from cache-friendly data layouts. The less data that is read, the better. On a lower level, accessing data in a cache-friendly manner is always good, but I can’t comment on how that translates to DB-systems.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s