Adventures in data-oriented design – Part 2: Hierarchical data

Posted on February 22, 2013 by Stefan Reinalter

One task that is pretty common in game development is to transform data according to some sort of hierarchical layout. Today, we want to take a look at probably the most well-known example of such a task: transforming joints according to a skeleton hierarchy.

Specifically, we will look at the difference between a simple OOP-based by-the-book implementation, and a more data-oriented design. We are going to cover the differences both in terms of implementation and performance. And once more we will see that data-oriented design and object-oriented programming do not contradict each other!

Skeletal animation

Before we can start thinking of a possible implementation, we need to cover some ground first by quickly identifying the steps needed for skeletal animation.

Typically, an animated mesh such as a character consists of several joints that are connected with each other in a hierarchical fashion. The animation data specifies animation curves for each of the different components that make up the transformation of a joint, e.g. x-, y- and z-values for translation (and scaling, if supported), and x-, y-, z- and w-values for the rotation (stored as a quaternion).

Skeleton consisting of 96 joints

The data of the animation curves of each joint is stored in their local frame (= their local coordinate system) because that is easier to animate, and leads to more natural-looking animations. This means that whenever you sample the animation curves, all the transformations are stored in the local coordinate frame, and need to be transformed into the global coordinate frame in order to render the joints or the skinned character. Hence, sampling the animation curves gives us a transformation for each joint, called the local pose.

Note that if we want to blend between different joint transformations (typically done using LERP or SLERP), those blends are performed on local poses. After blending, layering, etc. has finished, all joints are transformed into their global pose. This is done using the joint hierarchy.

To recap, these are the steps that need to be performed:

Work out at what time we want to sample the animation curves. This is influenced by the speed of the animation, whether it’s looping, etc. The result of this is called the local pose.
Optionally blend between different local poses to smooth the animation.
Transform the local pose into the global pose according to the hierarchy.

I deliberately left out other steps such as ragdolls, steps that can affect the global pose and need a round-trip to the local pose (IK, FK), and others, in order to focus on the key parts of this post.

Demo setup

The demo we will be using to measure the difference between different implementations simply performs the above steps for 1000 characters. Our timings will measure how long it takes to perform step 3 (traversing the hierarchy and transforming the joints’ poses) a thousand times. We will use the character consisting of 96 joints (shown above) for this demo.

An army of 1000 characters, all sampled at a different point in time

Initial implementation

A simple, naive implementation of a hierarchical, skeletal data structure could look like the following:

struct Joint
{
  math::matrix4x4_t globalPose;
  math::matrix4x4_t localPose;
  std::vector<Joint*> children;
};

Simple enough: each joint stores a local pose, a global pose, and an array of child joints. A skeleton then becomes an array of joints, and that’s it. In order to transform the skeleton starting from the root, a simple recursive function can be used:

class Skeleton
{
public:
  void LocalToGlobalPose(void)
  {
    // start at the root
    LocalToGlobalPose(&joints[0], math::MatrixIdentity());
  }

private:
  void LocalToGlobalPose(Joint* joint, math::matrix4x4_arg_t parentTransform)
  {
    joint->globalPose = math::MatrixMul(parentTransform, joint->localPose);

    // propagate the transformation to the children
    for (size_t i=0; i < joint->children.size(); ++i)
    {
      LocalToGlobalPose(joint->children[i], joint->globalPose);
    }
  }

  Joint* joints;
};

The above traverses the hierarchy starting from the root, and propagates the transformation of a parent joint down its children. This is done by first storing the resulting global pose of the joint, and then passing it along to the children. The code is short, and should be pretty self-explanatory.

So the question is, how long does it take to build the global pose for 1000 characters using the above code? The answer: 3.8ms. We cannot decide yet whether that’s good or bad, but we can surely do better, otherwise I wouldn’t post about it.

A data-oriented approach

The first thing to observe here is simple, and should not come as a surprise to most of you: instead of storing the children for each joint, we can turn the problem on its head, and just store a parent joint for each joint instead. So instead of storing a std::vector<Joint*> children we store a Joint* parent instead.

The next optimization is also simple: we store all joints of a skeleton in an array anyway, so why not store indices instead of pointers? Assuming no skeleton has more than 65536 joints (which is a safe assumption I would say), we can store a uint16_t parent instead of the pointer previously mentioned.

The benefits?

Storing an uint16_t instead of a pointer needs less space, especially on 64-bit systems. We save memory, and have less memory to access when building the global pose.
A complete skeleton is now trivially copy-able, and can be moved around in memory using e.g. memcpy. Similarly, a whole skeleton can be loaded from disk using a single binary read without having to worry about pointer-fixups or similar. This alone is worth the data transformation.

The question that remains is: how do we change the code so that we still traverse the hierarchy in the proper order, and how do we get rid of the recursion?

In order to fix that, there is one key observation to make: as long as we always traverse the hierarchy in the same order, we can allocate and store the joints in that exact order, and walk through the array of joints linearly by flattening the hierarchy. Looking at the recursive code above, we can see that the code traverses the hierarchy in what is called depth-first order.
Now imagine traversing the hierarchy, numbering the joints that are visited with a monotonically increasing number. Joint 0 would be the root, joint 1 would be the next in the hierarchy following a depth-first traversal, and so on. This means that if we iterate through the array of joints, a joint i can only have a joint j as its parent with j < i. This greatly simplifies the traversal code, and gets rid of the recursion.

One last thing left to do is change the layout from AoS (array-of-structures) to SoA (structure-of-arrays), leaving us with the following:

class Skeleton
{
private:
  uint16_t* hierarchy;
  math::matrix4x4_t* localPoses;
  math::matrix4x4_t* globalPoses;
};

This change has the benefit that whenever we access e.g. the parent of a joint using hierarchy[i], several dozens of the next parents will also be read into the cache with the same cache-line. The same is true for accessing localPoses[i].
So what does the transformation code look like? It’s become surprisingly simple:

void LocalToGlobalPose(void)
{
  // the root has no parent
  globalPoses[0] = localPoses[0];

  for (unsigned int i=1; i < NUM_JOINTS; ++i)
  {
    const uint16_t parentJoint = hierarchy[i];
    globalPoses[i] = math::MatrixMul(globalPoses[parentJoint], localPoses[i]);
  }
}

Beautiful, and doesn’t get any simpler.

As stated above, this works because parentJoint < i, which means we will only access global poses which have been computed already. The recursive traversal that we had before is gone and is now implicitly given by the positions of joints in the array, which we numbered in a depth-first fashion.

What difference in performance does this make? With the same amount of transformations, this code now takes 3.1ms. Compared to the 3.8ms we had, that’s more than an 18% speed increase!

Conclusion

Not only did our few simple changes result in a nice increase in performance, it also vastly simplified the code itself. And we didn’t have to abandon OOP principles, we didn’t even have to change the calling code. We still have our skeleton class, and call LocalToGlobalPose() on it.

Concise, fast, and simple.

29 thoughts on “Adventures in data-oriented design – Part 2: Hierarchical data”

Dirk on February 22, 2013 at 5:20 pm said:

Hi Stefan!

Nice post again! Did you try to store the poses as a vector and quaternion (assuming no scale for simplicity)? I wonder how the quaternion walk compares to the matrix walk and then convert and copy out the matrices in another sweep. This would also be very SIMD friendly.

In this context I remember a presentation from Crytek last year at Siggraph and they even send quaternions to the shader iirc. So maybe the conversion would not even be necessary anymore.

Reply ↓
- Stefan Reinalter on February 22, 2013 at 6:56 pm said:
  
  Hi Dirk!
  
  The local poses are gathered from the animation curves as a vector (translation), quaternion (rotation), and vector (scaling). After LERPing between poses I convert them to matrices before building the global pose. I mostly did it that way because it keeps the code for constructing the global pose short, and the matrix multiplication is already SIMD-ified (using 4 dot-products). Furthermore, I don’t have to write different code paths for animations that need scaling, and for ones that don’t.
  Your question raises a valid point though, I too wonder which version would be faster. I didn’t dig into that yet, maybe for another post :).
  
  Regarding your second point, I can imagine sending quaternions to the GPU being faster, depending on the platform/architecture. At the moment I build the matrix palette on the CPU, and also do the skinning on the CPU (using double-buffered vertex buffers). This allows me to use skinned characters in different passes without having to pay the price for skinning on the GPU several times. Especially on the PS3 I preferred doing it on the SPUs rather than on the GPU. On the Xbox360 however, using the GPU and MEMEXPORT is faster.
  
  So for the time being, I do all of that on multiple threads on the CPU, and try to build custom solution for particular hardware if it brings a significant performance gain.
  Maybe the compute units on next-gen platforms unify that…
  
  Reply ↓
- Stefan Reinalter on February 27, 2013 at 2:11 pm said:
  
  I’ve looked into this, and the results are as follows:
  
  Version 1) Gather local poses into a 4×4 matrix for each joint, walk the hierarchy using matrix multiplications.
  Version 2) Gather local poses as a vector and a quaternion, walk the hierarchy using quaternion multiplications and vector adds, convert to 4×4 matrices before rendering.
  
  Walking the hierarchy using version 2 is about 50% faster than version 1 on my i7. To be fair though, we use a full 4×4 matrix where a 3×4 matrix would suffice, and version 2 does not support scaling, whereas the 4×4 matrix does (and even the 3×4 would).
  
  If we don’t need scaling on animations, using a vector and a quaternion is a clear winner over using matrices. If you need to support scaling as well, I guess version 2 would be roughly the same speed as using a 3×4 matrix in version 1. Both need to touch the same amount of data, and the instruction count is almost the same.
  As you pointed out, quaternions do have the advantage that you can send them to the shader directly, saving some CPU work and shader constants if you need to.
  
  Reply ↓
  - Dirk on February 27, 2013 at 8:39 pm said:
    
    Nice! Thanks for the update. Personally I think I would only support uniform scaling and I guess this makes a lot sense for games. I like to be able to store the transform in a decomposed state and clearly return position, translation and scale. I would argue that adding a single float for uniform scaling and propagate this through the hierarchy would not effect your performance too much, right?
    
    I am also assuming the quaternion math is SIMD’ized as the matrices.
  - Stefan Reinalter on February 27, 2013 at 10:15 pm said:
    
    Nope, adding uniform scaling shouldn’t cost too much performance. Yes, vector/quaternion/matrix math all use SIMD instructions.
    And thanks for pointing out the fact that in the non-matrix version, you have access to individual components such as translation, rotation and scale. I can see where that is surely beneficial (procedural movement, e.g. blinking eyes, head-movement, etc.).
    
    I guess we have a clear winner then :).
  - Marc B. Reynolds on March 4, 2013 at 7:55 am said:
    
    You don’t seem to find it in writing anywhere, but quaternions support uniform scaling just fine: p’ = q p q* (simplified rotation form)…multiply q by non-zero scalar ‘s’: (qs) p (qs)* = ss (q p q*) = ss p’. So sqrt(s) is a uniform scale factor and these obviously naturally compose.
  - Stefan Reinalter on March 4, 2013 at 1:48 pm said:
    
    Hi Marc,
    
    Thanks for the tip, I’ll keep that in mind when adding scaling.
Stefan Reinalter on February 22, 2013 at 7:12 pm said:

Two quick performance tips that I didn’t point out in the post in order to keep it focused on the data layout:

1) In practice, you want to allocate the hierarchy, the local poses and the global poses next to each other in memory. That can easily be done using either a linear allocator, or by allocating memory for the whole skeleton at once, letting the pointers point into different parts of the allocation.

2) For maximum performance, you additionally might want to work with restricted pointers, either by restricting (e.g. __restrict on MSVC) the LocalToGlobalPose() method, or by restricting the globalPoses and localPoses pointers to give the compiler further optimization options. This can also be done in the first variant of the implementation, but the data-oriented approach makes it easier and more clear.

Reply ↓
- Tim Peters on April 3, 2013 at 12:49 pm said:
  
  Would you create a new HeapArea per skeleton for a situation where you dynamically load and unload skeletons, or use some kind of HeapAllocator? Or maybe some totally different solution is in order?
  
  Reply ↓
  - Stefan Reinalter on April 4, 2013 at 12:51 pm said:
    
    Hmmm, I’m trying to think of scenarios where I would dynamically load/unload skeleton data… if they are streamed in, I would probably use a separate allocator for streamed data altogether, and do additional defragmentation of that heap each frame.
  - Tim Peters on April 4, 2013 at 12:59 pm said:
    
    Well, you’d need to duplicate a skeleton every time you spawn a new instance of a unit (except in situations where their animations are exactly in sync maybe)
    
    So i’d say in any type of game where you can spawn units dynamically you’d need to load/duplicate skeleton data on the fly.
    
    Thinking about it a defragmenting allocator is probably a good solution
  - Stefan Reinalter on April 4, 2013 at 1:16 pm said:
    
    No, not necessarily.
    What you have is 1) the hierarchical data and 2) the animation data.
    
    If you skin one character, you sample the animation data at a certain point in time, and create the scale/rotation/translation data for each bone. According to the hierarchical data, inverse bind pose, etc. you generate the global pose for each bone. Using that data, you skin the character.
    Now, if we have several characters, you have the same animation data, and the same hierarchical data (as long as the characters are part of the same rig). You only need temporary memory for the scale/rotation/translation data and/or the global pose of each bone. But that doesn’t mean you have to duplicate the skeleton, does it?
    
    If you do the skinning on the CPU, you need one buffer that holds the global pose for each bone, and just skin your characters one after another (in a single-threaded environment). Even when you have multiple threads (e.g. each of them skinning some N characters), each thread only has to hold one buffer for the global poses.
    
    You always need to have vertex buffers that hold the final, skinned data though. So yes, you could use a defragmenting allocator. Or maybe even a pool allocator, if you can limit the number of individually animated characters that are on the screen at the same time.
- Tim Peters on April 4, 2013 at 1:24 pm said:
  
  ooooh, that makes a lot of sense, you only save the current animation and position in the animation per unit and use the same skeleton to generate the positional data, after that load the generated data into a VBO or similar buffer.
  
  so really you only need a few different skeletons, if multiple animations and meshes use the same skeleton.
  
  thanks!
  
  Reply ↓
- Jani Nurminen on September 1, 2013 at 6:30 pm said:
  
  This is already a little bit old post, but one thing puzzled me while I read this. What kind of disk memory layout you would be using since you mentioned that you can load skeleton from disk with one read but without fixing pointers?
  
  Based on your post Skeleton struct is something like this:
  
  struct Skeleton
  {
  uint16_t numJoints;
  uint16_t* hierarchy;
  math::matrix4x4_t* localPoses;
  math::matrix4x4_t* globalPoses;
  };
  
  If you just write numJoints, hierarchy values, localPoses and globalPoses you can’t just cast read memory block to Skeleton struct unless you have fixed array sizes (NUM_JOINTS in sample code might indicate this?). Other option is to write numJoints, pointers and then actual data blocks. But in this case you would need to handle 32 vs. 64 bit cases somehow and pointers would need to be still fixed after struct is created.
  
  Reply ↓
  - Stefan Reinalter on September 1, 2013 at 11:16 pm said:
    
    Ah yes, good point.
    
    The important thing to realize is that by using a flattened hierarchy, you can read all of the arrays’ data in one go. Compare that to a hierarchical runtime structure which forces you to read data from disk, and then build a (pointer-based) tree structure from it. The latter is what most novices do, which leads to a lot of small, individual allocations (for the tree nodes), and fragments the heap.
    
    Once you deal with flattened arrays, it doesn’t make such a big difference how you load and setup your struct. It would be perfectly okay to allocate memory once, load all data into it from disk, and simply set the hierarchy, localPoses and globalPoses pointers to point into that memory. It would also be okay to do three single allocations (one for each array), and directly read from disk into the array’s memory. Or one could use a binary blob approach such as the one described here.
    
    Use whatever floats your boat, most of the time I chose to do things explicitly (do one allocation, set the pointers). Having the three pointers in the struct makes debugging easier, and you can always change it later on.
Dirk on February 22, 2013 at 8:58 pm said:

How much difference do these two optimizations make? Just curious 🙂

Reply ↓
- Stefan Reinalter on February 22, 2013 at 9:24 pm said:
  
  I’ll let you know next week, but probably not much of a difference on an i7 with its massive cache and everything :).
  
  Reply ↓
Dirk on February 28, 2013 at 12:47 am said:

Thanks Stefan. This is really interesting. I am learning about this at the moment and your post was really helpful. I have one final question. In a real application would you store the global poses in a persistent buffer or will they just be stored on a per frame basis and then uploaded to GPU?

Reply ↓
- Stefan Reinalter on February 28, 2013 at 11:44 am said:
  
  I think the answer to that is once again “it depends”.
  
  Note that the global poses themselves are normally not what you upload to the GPU, but rather the globalPose*inverseBindPose for each joint (which is called the matrix palette).
  
  I think I would set aside temporary memory for a palette of 512 matrices, and then do the following (assuming N characters are each rendered M times):
  
  1) Build global pose -> matrix palette into temporary memory
  2) Render all M instances of this character using this matrix palette (using instanced rendering if the API supports it)
  3) Repeat steps 1 and 2 for N characters
  
  This allows you to do cheap crowd rendering with only a minimal amount of temporary memory needed.
  
  I can also see scenarios where you might want to store the matrix palette for certain characters in a persistent buffer, e.g. for rendering them with 30Hz, but only updating them with 15Hz or similar (because you’re working with performance-limited platforms).
  
  Reply ↓
  - rompa69 on March 16, 2013 at 1:08 am said:
    
    The most common case for persistence, at least in my experience, is for shadow rendering… anyway, I was thinking that perhaps it might be best to perform a flattening with a breadth first traversal, as you might have several bones parented to the one (such as fingers, toes and facial control bones) and they’d probably benefit from having the computed global parent matrix in the cache. I’ve used this flattening technique for years now and it works really well (even on a 60MHz Nintendo DS), excepting of course when you want to dynamically modify the bone hierarchy. Thanks for your interesting musings 🙂
  - Stefan Reinalter on March 16, 2013 at 10:21 pm said:
    
    Hi Mark,
    
    Oh, you’re absolutely right about persistence for shadow rendering! No matter if you use SPUs, MEMEXPORT or a compute shader, you’ll want to have the data lying around for subsequent rendering passes.
    
    Flattening with a breadth-first traversal also sounds like a good idea. I think most of the global parent matrices should be in the cache anyway (as long as you don’t have tons of bones or a very, very small D-cache), but you’re right that with the common type of hierarchical data (=characters), it is generally probably better.
    
    Thanks Mark!
Pingback: A faster quaternion-vector multiplication | Molecular Musings
Dudeson on April 28, 2015 at 6:33 am said:

Great article!
But I have a question. I quote:
“One last thing left to do is change the layout from AoS (array-of-structures) to SoA (structure-of-arrays)… This change has the benefit that whenever we access e.g. the parent of a joint using hierarchy[i], several dozens of the next parents will also be read into the cache with the same cache-line.”
Wouldn’t we get the exact same effect if we just stored an array of structures instead? I never heard someone say to use SoA for performance instead of AoS.

Reply ↓
- Stefan Reinalter on April 30, 2015 at 12:53 am said:
  
  Wouldn’t we get the exact same effect if we just stored an array of structures instead? I never heard someone say to use SoA for performance instead of AoS.
  
  I think you got SoA and AoS mixed up.
  Most of the time, SoA yields better cache performance as long as you touch only parts of your data structures, Particle Systems being the best and most commonly known example.
  Furthermore, as soon as you want to use SSE, AVX, VMX, and the likes, SoA is often vastly better than AoS.
  
  Reply ↓
  - Dudeson on April 30, 2015 at 8:57 pm said:
    
    No, I didn’t get them mixed up.
    “Most of the time, SoA yields better cache performance as long as you touch only parts of your data structures”
    That’s exactly why you should only put the kind of data into your structures that you’re actually going to access. That’s why I wondered why you advised to use SoA.
    But why are SSE instructions and co. better for SoA? I mean, you need to look-up the parent bone regardless of AoS and SoA.
  - Stefan Reinalter on May 2, 2015 at 2:35 pm said:
    
    That’s exactly why you should only put the kind of data into your structures that you’re actually going to access. That’s why I wondered why you advised to use SoA.
    
    First, I’m not the first to advise to use SoA for improved cache performance and larger gains when using vectorized instructions. Quite the contrary. Second, even if you only put the minimal amount of data into your data structures, it will hurt you if you don’t access all the data all the time.
    
    Let me illustrate my point by giving one of the probably most prominent examples: particle systems.
    
    If we were to put only the absolute minimum amount of data into our structure holding particle data, it would probably look something like the following:
    struct Particle { Vec3 position; Vec3 velocity; float lifetime; uint32_t color; };
    
    I deliberately left out other things you would likely need (angle, size, …) to keep the struct small.
    Now, if the particle system touches all the data of all the particles all the time in one monolithic function, you don’t need to read any further. But I highly doubt that this would be the case in a somewhat modern and dynamic particle system, with different emitters, spawn behaviour, special effects, maybe scripting, and so on.
    What you are more likely to have are different functions, which only touch the data relevant to them. For example, the integration step needs position and velocity, but isn’t interested in lifetime or color. Similarly, fading out the particles over time only touches the lifetime and the color.
    This means that you are going to pull in more data into your cache lines than you are actually going to need.
    
    If you instead use a SoA data structure for your particles, it would look like the following:
    
    struct ParticleSoA { Vec3* position; Vec3* velocity; float* lifetime; uint32_t* color; };
    Now whenever some function needs to work on one aspect of the particles, it only needs to pull in that data. We don’t clobber our cache with data which isn’t needed at that time.
    
    But why are SSE instructions and co. better for SoA?
    
    Because you cannot use vectorized instructions efficiently with an AoS layout of the data. Even if Vec3 uses SSE instructions internally, you only use 75% of the computing power available to you, because the fourth component of the vector is never used. Even worse with e.g. AVX, where only 3 out of 8 components are used. Using SSE for Vec4, Color4 and the likes won’t help, even though it is what most people probably do.
    
    Instead, you could use the following:
    
    struct ParticleSoA { float* positionX; float* positionY; float* positionZ; float* velocityX; float* velocityY; float* velocityZ; float* lifetime; uint8_t* colorR; uint8_t* colorG; uint8_t* colorB; uint8_t* colorA; };
    
    Now you can conveniently work on 4, 8, or even 16 or 32 particles at the same time, using SSE and AVX at a full 100% percent.
    
    Also see Andreas Fredriksson’s slides from this year’s GDC for more details.
Pingback: Creating an Optimized Transform Hierarchy – Alex Sabourin
AJ on September 30, 2019 at 10:20 pm said:

Sorry for the necro, but if I understood correctly, for each mesh associated with a particular skeleton, you’re overwriting the local/global transforms in the skeleton struct so that their transforms are contiguous to the hierarchical data. So how would this work with multi-threading?

Reply ↓
Pingback: Data Oriented Design, DOD, 데이터 지향 설계 – hongildong