# Adventures in data-oriented design – Part 4: Skinning it to 11

Having finished the third part of this series about data ownership, we will turn our attention to performance optimizations and data layout again in this post. More specifically, we will detail how character skinning can be optimized with a few simple code and data changes.

The initial sample program we are going to optimize accomplishes the following:

• Skinning & rendering of 100 characters, each character is animated & skinned individually.
• Each character can play a different animation at a different animation frame, which means no reusing of shared data.
• Each vertex can be influenced by a maximum of 4 bones.
• No instanced rendering, no mesh LODs.

The character we are going to use has about 5k triangles, position & normal data, and one set of texture coordinates. This is what the output of the sample looks like:

100 individually animated characters

We are only interested in performance figures for skinning the characters, n.b. the time it takes the CPU to skin all of the 100 characters. We are not interested in rendering.
As can be seen from the screenshot above, each character has one of three different materials assigned, and plays back one of three animations (either crouch, walk or run), starting at randomly chosen animation frames so data cannot be reused across characters.

First implementation

In the initial version, skinning was implemented as follows:

• Each vertex is assumed to have between 1 and 4 bone influences.
• The destination vertex stream consists of a float3 for position data, a float3 for normal data, and a float2 for texture coordinate data.
• The skinning code runs single-threaded.
• The skinning code uses SSE-optimized math routines for all matrix and vector operations.

In C++-pseudo-code, the skinning code is basically the following:

```for each vertex i
{
// load indices and weights
uint16_t index0 = jointIndices[i*4 + 0];
uint16_t index1 = jointIndices[i*4 + 1];
uint16_t index2 = jointIndices[i*4 + 2];
uint16_t index3 = jointIndices[i*4 + 3];

float weight0 = jointWeights[i*4 + 0];
float weight1 = jointWeights[i*4 + 1];
float weight2 = jointWeights[i*4 + 2];
float weight3 = jointWeights[i*4 + 3];

// build weighted matrix according to matrix palette and bone weights
matrix4x4_t mat0 = MatrixMul(matrixPalette[index0], weight0);
matrix4x4_t mat1 = MatrixMul(matrixPalette[index1], weight1);
matrix4x4_t mat2 = MatrixMul(matrixPalette[index2], weight2);
matrix4x4_t mat3 = MatrixMul(matrixPalette[index3], weight3);

// skin position, normal, tangent, bitangent data
vector4_t position = srcVertexData[i].position;
...

vector4_t skinnedPosition = MatrixMul(weightedMatrix, position);
...

dstVertexData[i].position = skinnedPosition;
...

// copy unskinned data
dstVertexData[i].uvs = srcVertexData[i].uvs;
...
}
```

For those familiar with the concept of skinning it should be quite clear what the code is doing. It first builds the weighted matrix according to the bone weights, and using the matrix palette generated by the animation system. Position, normals, etc. are then skinned using the weighted matrix, and all left-over unskinned data such as texture coordinates, vertex colors, etc. is simply copied over.

With this implementation, skinning 100 characters took 4.15ms on my system (i7-2600K, 4x 3.40 GHz). But there is a lot of room for improvement, so let’s start with the simplest optimizations first.

Splitting streams

The first thing that can be noted in the above loop is the copying of unskinned data. Even though that data never changes, it has to be copied each and every time.

Why? Most of the time, we cannot lock/map the data without having to fill the vertex buffer with all of the data again, e.g. when using D3D11_MAP_WRITE_DISCARD. Even if there was a different flag in PC APIs, or if we had direct access to the GPU memory (like we do on consoles), writing partial data is a bad idea anyway. Most GPU resources reside in write-combined memory, where it’s an especially bad idea to leave holes when writing data. On consoles, only writing partial data into a dynamic buffer can be a real performance killer!

The way Molecule solves this problem is by splitting the vertex data streams for skinned components into two parts. One part contains all the data that needs to be skinned, the other part contains all unskinned data. The vertex data is simply split into two streams offline in the asset pipeline, and the engine runtime uses two instead of one vertex stream upon rendering.

This optimization completely eliminates all unnecessary copying of unskinned data, and brings the performance figures from 4.15ms down to 4.00ms in our case. The more unskinned data (vertex colors, a second set of Uvs, etc.) and the more vertices we have, the bigger the savings become.

SIMD-friendly data layout

As stated in the paragraph about the first version of the implementation, position data (as well as any other skinned data for that matter) is stored as a float3 in the destination vertex stream. Most SIMD ISAs such as Intel’s SSE instructions operate on 4 floats at a time, and only support reading/writing a whole float4 from/to memory. This means that in order to store 3 individual floats (x, y, and z) into memory, we first have to extract them from the SIMD datatype into distinct floats, and then write them into memory.

This is bad. All data should stay in the same CPU pipeline as long as possible. Scalar data should stay in the scalar pipeline, SIMD data should stay in the SIMD pipeline. Moving data from one pipeline to another is always going to hurt performance – on consoles, going from floats to SIMD datatypes (and vice versa!) always incurs a load-hit-store penalty, because the data has to travel via memory! This is also true for moving data between the integer and floating-point pipelines, but let’s not digress into the realm of PowerPC-based console architectures…

Fortunately, all of this can be solved easily by slightly increasing the size of our vertex buffers. Instead of expecting float3 skinned data, all skinned vertex buffers in Molecule expect the data to be float4. This allows us to write SSE-datatypes directly into vertex buffer memory. Furthermore, D3D11 guarantees buffer data to be 16-byte aligned, which means we can use aligned stores (_mm_store_ps, MOVAPS).

Additionally, by also changing our source data from float3 to float4 and properly aligning it on a 16-byte boundary, we can use aligned reads (_mm_load_ps, MOVAPS) as well.

This means we can load from memory into SIMD types directly, perform all operations on those datatypes, and write the results back to memory without ever leaving the SIMD pipeline. This optimization brings the performance figures from 4.00ms down to 3.53ms. Note that even though we have to read and write more data, performance still increased. This is a good example of choosing the right data layout for the job, which doesn’t always mean keeping data as small as possible! Admittedly, most of the time keeping the amount of data that is read or written as small as possible is a good idea, though.

Runtime-friendly data layout

Remember that data-oriented design is mostly about how the data is read, transformed, and written. One mantra is to keep homogeneous streams of data contiguous in memory, and only store as little data as possible. This means that for certain kinds of data (the prime example being particles) it is often beneficial to store the data in a SoA rather than AoS fashion.

In our example, all the data we need for skinning is already stored in a cache-friendly manner:

• All the joint indices and weights are stored contiguously. They are accessed sequentially, no jumping around in memory.
• The vertex data contains only the data that needs to be skinned, unskinned data has been kicked out already.
• Vertex data is read and written sequentially, no jumping around in memory.

Without any additional knowledge about the data itself, one could say that the memory layout is already as cache-friendly as possible.
But it doesn’t stop there! Even if the data is stored contiguously already, it makes sense to think about what the data contains, and if that insight allows us to optimize memory accesses even more.

Remember that we stated that „each vertex is assumed to have between 1 and 4 bone influences“ in the beginning of the post. This means that for those vertices having less than 4 bone influences, the corresponding weights will be zero, and we are doing a lot of unnecessary work when building the weighted matrix.

How can we get rid of the superfluous operations? Novice programmers will often try to get rid of additional work by branching on a conditional, not doing anything in case the condition doesn’t hold. In our case, we could simply branch on the different weights and check whether they actually contribute to the weighted matrix like so:

```if (weight0 != 0.0f)
{
weightedMatrix = MatrixAdd(weightedMatrix, MatrixMul(matrixPalette[index0], weight0));
}

if (weight1 != 0.0f)
{
weightedMatrix = MatrixAdd(weightedMatrix, MatrixMul(matrixPalette[index1], weight1));
}

if (weight2 != 0.0f)
{
weightedMatrix = MatrixAdd(weightedMatrix, MatrixMul(matrixPalette[index2], weight2));
}

if (weight3 != 0.0f)
{
weightedMatrix = MatrixAdd(weightedMatrix, MatrixMul(matrixPalette[index3], weight3));
}
```

The bad news is that by introducing those extra branches the code will most likely run slower than the original implementation! Hardware branch prediction is really good at detecting branches following a certain pattern, but not so much with more or less random branches. Furthermore, branch prediction penalties on in-order CPUs like current-gen PowerPC-based consoles cost about 24 cycles each, so you’re better off doing the math operations anyway.
If your skinned mesh has lots of vertices with 1 or 2 bone influences, the code might run faster, but we really want something that runs faster in every case, not only in edge cases.

One other option to consider is storing the number of bone influences for every vertex, and replace the code building the weighted matrix with a separate loop, similar to the following code:

```for each influence j
{
weightedMatrix = MatrixAdd(weightedMatrix, MatrixMul(matrixPalette[index_j], weight_j));
}
```

Note that we now need to store the number of influences per vertex, and we added the overhead of an inner loop for each vertex. Yes, loops in such spots do have measurable overhead. Additionally, we need to fetch more data from memory because of the per-vertex influence count. If your skinnded mesh has lots of vertices with 3 or 4 bone influences, the „optimized“ code will most likely run slower because of the additional overhead and memory fetches – so again, we have optimized for an edge case.

We really want the code to adapt perfectly to the input data, no matter what the data is.

Storing the number of bone influences per vertex was already heading into the right direction, but introduced too much overhead. But there is one simple thing that we can do: rather than store the influence-count per vertex, why not sort the data and store the number of vertices having a certain influence count instead? We know that we have either 1, 2, 3 or 4 influences, so all we need to store is 4 loop counts!

So instead of storing the vertices in any order, we rearrange them (along with the index buffer) so that the source data first contains all vertices having only 1 bone influence, then all vertices having 2 influences, and so on. In addition to that, we store the number of vertices having 1 influence, the number of vertices having 2 influences, and so on.
All of that can be conveniently done in the asset pipeline, so it incurs no extra runtime cost.

The good news is that this allows us to only do the absolute possible minimum of work for each vertex by writing specialized code for each path. Furthermore, this opens up new optimization possibilities because we now exactly know what the data contains! As an example, we don’t need to store the weights for bones having only one influence – it is 1.0f anyway. That saves memory and leads to less memory accesses.

In the end, the code now looks similar to the following:

```for numVertices1Bone:
{
uint16_t index0 = jointIndices[i];

vector4_t position = srcVertexData[i];
vector4_t skinnedPosition = MatrixMul(matrixPalette[index0],position);

dstVertexData[i] = skinnedPosition;
}

for numVertices2Bones:
{
uint16_t index0 = jointIndices[i*2 + 0];
uint16_t index1 = jointIndices[i*2 + 1];

float weight0 = jointWeights[i*2 + 0];
float weight1 = jointWeights[i*2 + 1];

matrix4x4_t mat0 = MatrixMul(matrixPalette[index0], weight0);
matrix4x4_t mat1 = MatrixMul(matrixPalette[index1], weight1);

matrix4x4_t weightedMatrix = MatrixAdd(mat0, mat1);

vector4_t position = srcVertexData[i];
vector4_t skinnedPosition = MatrixMul(weightedMatrix, position);
dstVertexData[i] = skinnedPosition;
}

// code for 3 and 4 influences omitted
```

Note how we only ever carry out the operations that are absolutely needed. No inner loops, no extra branches, just straightforward code.

As you may have guessed already, this optimization yields the biggest bang for the buck – it brings the performance figures from 3.53ms down to 2.00ms. Of course it depends on the number of influences each vertex has, but it is optimal in every case, not just edge cases. For the 5k test character, the percentages of influences per vertex are roughly distributed as follows:
30% of vertices have 4 bone influences, 25% of vertices have 3, 20% have 2, 25% have 1 bone influence only.

Final touches

As a final optimization, the code can be rewritten to make it more compiler-friendly, which brings the performance figures from 2.00ms down to 1.78ms. Because those optimizations are highly compiler-specific, I won’t go into detail here. MSVC seems to like tight code because that causes less register spilling on the stack. GCC seems to favor more verbose code (similar to the one written above) which clearly shows constant and non-aliased variables.

And last but not least, building a 64-bit executable gives an additional speedup of 15% simply because there are more SSE registers available, leading to less register spilling.

Conclusion

In the end, we managed to get a decent ~2.75x speedup by a few data transformations, some code changes, and building for 64-bit systems. Implementing all of that took about half a day, and was absolutely worth it. Making use of Molecule’s task system and going fully multi-threaded further increases performance, scaling almost perfectly (linearly) across all available cores. As seen above, the code consists of 4 straightforward loops only, which can be trivially parallelized.

As a final note, the data transformations are also beneficial on compute shader-based architectures, simply because there’s less data to read and less work to do.

## 10 thoughts on “Adventures in data-oriented design – Part 4: Skinning it to 11”

1. Hey 🙂

While I know the articles goal is to present the user a very concrete example of how DOD can speed up a program I’m just wondering if your engine uses only a CPU implementation (all platforms) or do you also support a GPU one (specific platforms)? Like vertex streamout using the graphics pipeline or write to a buffer using the compute pipeline for later consumption during shadow / motionblur / rendering etc which should be way more efficient. Again I know that articles goal was to present DOD but I still wonder 🙂

On consoles you have unified memory so accessing it on CPU should be no problems. Ofc on PC there is a penalty for travelling through PCI-E bus. On the other hand PC CPUs are rarely bottleneck so I’ll assume that you have GPU versions as well right? 🙂

Cheers

• And on the new consoles you have async compute which can run in parallel with the graphics where you can squeeze some extra computation 🙂

• Yep, that sounds useful. We do get that with Mantle as well, right?

• Thanks, that’s great to hear!

• For now, I only have a CPU implementation. A compute-based version for the GPU will be available later, but that’s not crucial yet.

Speaking of unified memory, the lack of manual memory management is what really sucks at the moment with the available PC APIs. Even though the GPU is of course faster than the CPU, a multi-threaded skinning algorithm running on several cores already is quite fast, so in situations where a game is GPU-bound I would really like to do such things on the CPU. However, the best you can do right now is to use a DYNAMIC buffer to which the CPU writes (you at least get write-combined memory that way), and then copy to a DEFAULT buffer for rendering (otherwise, rendering speed will be really slow). Even for just a few skinned instances, this already takes milliseconds, which really is a shame.

I would so like to be in control of GPU memory, and the lack of that control is still the main reason why graphics programming on the PC feels like programming while wearing a jacket that’s three sizes too small. Trying to push *everything* into GPU-compute isn’t a real solution, either.

Rant over – that’s why I’m really looking forward to Mantle :).

• Yeah Mantle will be awesome and I hear your rant, I hope NVIDIA implements this as well. It’s not AMD specific. Though I am not involved in that project. I’ll see if I can get an answer regarding async compute though for PC 🙂

I wonder how well multicore CPU skinning would work on the new Gen4 consoles as it’s very interesting to free up even more GPU time like you say, especially if you are targetting 60 FPS and 1080p on the new consoles then 16 ms is not much to play with and I’d rather steal an ms or two from the CPU if I can do it efficiently 😛

Both the new consoles support AVX so that’s good.

Like running skinning + morph targets on the CPU at 1-2ms for 5-10 high poly characters on screen. What are your thoughts? Do you have any predictions or tests on the Gen4?

• Yes, I also think it’s sometimes important to free GPU resources by putting more on the CPU. I mean, for certain operations (like skinning) the GPU is of course massively faster than the CPU, so it could be beneficial running skinning on the GPU. On the other hand, if the GPU does almost everything, there’s not much that will still run on the CPU – I doubt that gameplay/scripting, AI, collision detection and physics peg a 8x-core CPU.
Then again, if you want to have 1000 uniquely skinned characters you have to do it on the GPU, otherwise it’s going to be too slow.

I don’t have any tests running on PS4/XB1 yet, but I think it’s going to vary from game to game. You can do a lot on eight cores, and the GPU will be busy with tons of graphics-related stuff at 1080p. On consoles you can at least write directly to write-combined memory using the CPU, which is faster than anything you get on PC (except Mantle probably).

• Oh and I just thought, using dual quaternions should speed this up even further since we deal with way less data. :))) Any experiments yet?

• No, no experiments yet, but on my longer-term todo-list. I’ll let you know once I’ve got some results to show :).

This site uses Akismet to reduce spam. Learn how your comment data is processed.