Stateless, layered, multi-threaded rendering – Part 4: Memory Management & Synchronization

Posted on February 13, 2015 by Stefan Reinalter

The last post of this series basically concluded with the following questions: how do we efficiently allocate memory for individual command packets in the case of multiple threads adding commands to the same bucket? How can we ensure good cache utilization throughout the whole process of storing and submitting command packets?

This is what we are going to tackle today. I want to show how bad allocation behavior for command packets can affect the performance of the whole multi-threaded rendering process, and what our alternatives are.

Previously in the series

This is the fourth post in a series of related blog posts. A complete list of the other posts includes:

Part 1: Basics
Part 2: Stateless API Design
Part 3: API Design Details
Part 4: Memory Management & Synchronization

Setup

First things first. The setup for the benchmark we want to perform is as follows:

All measurements are taken on an Intel Core i7-2600K, 3.40 GHz, 4-Core CPU. The CPU has 4 physical cores, but 8 logical ones because of enabled HyperThreading.
When run in a multi-threaded environment using Molecule’s task scheduler, there is one worker thread for each available logical core, which is 8 in our case.
One run of the benchmark submits 10.000 mesh components, and 10.000 light components. Each mesh component is added to two buckets: the GBuffer bucket, and the ShadowMap bucket, so 10.000 draw calls each. Each light component needs to be mapped in order to update some constants (resulting in 10.000 map operations), and appends a draw call to the Lighting bucket. All in all, this results in 40.000 AddCommand()/AppendCommand() operations.
Mesh and light components are already in-place, so no extra work like frustum culling or similar is being done. However, adding a command to a bucket generates some artificial extra work in order to more closely resemble a real workload.
Keys generated using GenerateKey() are distributed in a way that ensures that after sorting, the packets cannot be traversed in a purely contiguous fashion.
API submission does not talk to a rendering API such as D3D, but only calculates a hash from all the input arguments. This hash is checked to ensure that all implementations work correctly.
The measured time is the average of 1000 runs of the benchmark.
Caches are cleared in-between runs, by randomly accessing (reading & writing) 16MB worth of data unrelated to the benchmark.

Version 1

In order to get a feeling for how long the benchmark takes in a single-threaded scenario, we run it using the implementation described in the last post. The two things that we are interested in are the following:

How are individual packets allocated?

template <typename T>
CommandPacket Create(size_t auxMemorySize)
{
  return ::operator new(GetSize<T>(auxMemorySize));
}

In this case, we use global operator new for allocating memory for a packet.

How are commands added to the bucket?

template <typename U>
U* AddCommand(Key key, size_t auxMemorySize)
{
  CommandPacket packet = commandPacket::Create<U>(auxMemorySize);
  {
    const unsigned int current = m_current++;
    m_keys[current] = key;
    m_packets[current] = packet;
  }
  commandPacket::StoreNextCommandPacket(packet, nullptr);
  commandPacket::StoreBackendDispatchFunction(packet, U::DISPATCH_FUNCTION);

  return commandPacket::GetCommand<U>(packet);
}

Note that this implementation of AddCommand() is not thread-safe, which is OK in this case because we are only dealing with a single thread in this version.

Here are the timings for Version 1:

Adding commands took: 6.65ms.
Submitting commands took: 0.7ms.

Let’s see what we can do about that!

Version 2

The most obvious thing that is detrimental to our performance is allocating each packet using global operator new. First, there is no guarantee that individual packets are somewhat close to each other in memory, which hurts our cache-hit rate when submitting packets. Second, global operator new is slow – which is not the implementation’s fault, but rather ours. new, malloc, etc. were built as general purpose allocators, trying to do their best in the average case. However, we exactly know what we allocate, and what the lifetime of the allocation is: one frame. We can throw everything away after one frame.

Therefore, we replace the call to global operator new with a linear memory allocator, which is simply bumping a pointer in the single-threaded case:

void* alloc = m_current;
m_current += size;
return alloc;

We simply allocate a chunk of memory (say, 4MB) used by all our buckets beforehand, and use that for all allocations made by any bucket during a frame.

This gives us the following timings:

Adding commands took: 4.55ms.
Submitting commands took: 0.7ms.

Adding commands is now 1.46x as fast – nice! That was easy. Now, let’s throw some threading into the mix.

Version 3

In this multi-threaded scenario, we use Molecule’s task scheduler to distribute pieces of work to 8 different worker threads. For every 50 components being added to a bucket, a task is created. This means we’re creating 400 tasks (10.000 / 50 for mesh and light components, each) for the scheduler, let them run in parallel, and wait until they all have finished. This is done using simple parent/child relationships exposed by the scheduler API.

For allocating memory for a packet, we go back to using global operator new because it is already thread-safe.

For adding commands to a bucket, we make one simple change to the implementation:

template <typename U>
U* AddCommand(Key key, size_t auxMemorySize)
{
  CommandPacket packet = commandPacket::Create<U>(auxMemorySize);

  {
    const int32_t current = core::atomic::Increment(&m_current);
    m_keys[current] = key;
    m_packets[current] = packet;
  }

  commandPacket::StoreNextCommandPacket(packet, nullptr);
  commandPacket::StoreBackendDispatchFunction(packet, U::DISPATCH_FUNCTION);

  return commandPacket::GetCommand<U>(packet);
}

Instead of doing m_current++, we now use an atomic operation that increments the value of m_current atomically, returning its initial value before calling the function. On x86-64, this can be done by using either one of InterlockedIncrement, InterlockedExchangeAdd, or LOCK XADD in assembler code.

Sidenote: When working on platforms with a weakly-ordered memory model and/or CPU reordering such as the PowerPC cores found in Xbox360, PlayStation 3, and Wii U, you have to make sure that the atomic instruction is not reordered by the compiler by inserting the correct compiler barriers (or using Acquire/Release semantics). Additionally, you need to ensure that other cores can see the result of the atomic operation in memory by using appropriate memory barriers such as lwsync or sync.

With these changes in place, the timings are as follows:

Adding commands took: 1.85ms.
Submitting commands took: 0.7ms.

Going from single-threaded to multi-threaded submission using the same allocator brought us a 3.54x speedup, which is what was to be expected on a 4-core CPU. The task scheduler does have overhead, and its scalability could be better, but it’s nothing we need to worry about at the moment.

Again, let us try the linear allocator, but this time in a multi-threaded environment.

Version 4

Making the linear allocator thread-safe is easy to do, we just need to use an atomic operation for bumping the allocator’s pointer:

const int32_t currentOffset = core::atomic::Add(&m_offset, size);
return m_start + currentOffset;

Let’s see if we can also reap the 1.46x speedup in the multi-threaded case, by going from using global operator new to using a linear allocator:

Adding commands took: 1.72ms.
Submitting commands took: 0.7ms.

Hmm, this is only 1.07x as fast – what did we miss?

There are several things, actually. First of all, atomic operations aren’t as expensive as taking a lock, but they are not for free either. Second, there is now some amount of False sharing happening because all cores write to memory destinations residing on the same cache-line, at least some of the time. It is not as bad as it could be, because our allocation size is rather large compared to the cache-line size (20/36 bytes vs. 64 bytes, respectively). But still, it’s there.

Version 5

At the expense of using more memory, we can rule out false sharing by rounding the size of each allocation to a multiple of the cache-line size, ensuring that each thread/core completely owns whole cache-lines:

const int32_t currentOffset = core::atomic::Add(&m_offset, RoundUp(size, 64));
return m_start + currentOffset;

Now the timings are as follows:

Adding commands took: 1.65ms.
Submitting commands took: 0.9ms.

Adding commands is now faster, thanks to the elimination of False sharing. However, what we gained when adding commands is now lost when submitting them. The reason for that is that we now have to touch more memory than before, simple as that.

Version 6

We need to find a solution that avoids False sharing, but at the same time does not cause submitting the commands to touch more memory. And the solution for that is to use Thread-local storage (TLS) – or more specifically: thread-local linear allocators.

Because each thread gets its own linear allocator that isn’t touched by any other thread, we do not need to use atomic operations. Furthermore, because each allocator has its own chunk of memory, False sharing cannot happen. Of course, a disadvantage of this solution is that we need to allocate more memory beforehand, because we need a chunk for each of the thread-local allocators.

The implementation is simple, assuming your engine/codebase has support for TLS:

char* memory = tlsLinearMemory;
void* alloc = memory;
memory += size;

// write back
tlsLinearMemory = memory;
return alloc;

With this in place, the timings are:

Adding commands took: 1.45ms.
Submitting commands took: 0.7ms.

Not too bad. Submitting commands is fast again, and adding commands is 1.27x as fast as our original solution using global operator new.

There is still one thing left to do, though. There is still False sharing happening, although in a different place: when storing keys and packets in the arrays inside AddCommand() they are stored right next to each other, but they are probably being written to by completely different threads!

Version 7

This is the last version of the implementation, and gets rid of the False sharing happening inside AddCommand(). Again, we can use TLS to aid us, but it’s not as easy this time around.

The basic idea is as follows: instead of occupying a single entry when adding a new command, each thread that enters AddCommand() gets several entries (say, 32) that can be used for this and subsequent commands added by this thread. This means that when a thread first enters AddCommand() for a bucket, it gets 32 entries, and immediately uses one of them. The next 31 calls to AddCommand() done by this thread therefore don’t need any kind of synchronization or atomic operation, but can rather use the next of the 31 remaining entries. Once the number of remaining entries is zero, the thread gets assigned the next 32 entries.

Because each thread now writes to 32 adjacent entries in the array, False sharing is completely eliminated, due to the fact that each stored key occupies at least 2 byte, and the cache-line size is 64. Furthermore, we only need to do an atomic operation once every 32 calls to AddCommand() for each thread, per bucket.

The implementation is not as straightforward though, because you essentially need to have Thread-local storage per bucket, and not globally. In code, it would look something like this:

template <typename U>
U* AddCommand(Key key, size_t auxMemorySize)
{
  CommandPacket packet = commandPacket::Create<U>(auxMemorySize);

  // store key and pointer to the data
  {
    int32_t id = tlsThreadId;

    // fetch number of remaining packets in this layer for the thread with this id
    int32_t remaining = m_tlsRemaining[id];
    int32_t offset = m_tlsOffset[id];

    if (remaining == 0)
    {
      // no more storage in this block remaining, get new one
      offset = core::atomic::Add(&m_current, 32);
      remaining = 32;

      // write back
      m_tlsOffset[id] = offset;
    }

    int32_t current = offset + (32-remaining);
    m_keys[current] = key;
    m_packets[current] = packet;
    --remaining;

    // write back
    m_tlsRemaining[id] = remaining;
  }

  commandPacket::StoreNextCommandPacket(packet, nullptr);
  commandPacket::StoreBackendDispatchFunction(packet, U::DISPATCH_FUNCTION);

  return commandPacket::GetCommand<U>(packet);
}

In this implementation, we traded one atomic operation with a bit of False sharing per AddCommand() against one atomic operation every 32 calls, and 2-3 accesses to Thread-local storage per AddCommand().

What does this bring in terms of performance?

Adding commands took: 1.25ms.
Submitting commands took: 0.7ms.

Finally, with this last implementation, we have a speedup of 1.48x compared to the original implementation in the multi-threaded scenario. This means that going from global operator new to a linear allocator brought us the exact same speedup as in the single-threaded case – but only once we got rid of most synchronization points and False sharing!

Summary

ST: single-threaded
MT: multi-threaded
-----------------------------------------------------------
Version                             Adding       Submitting
-----------------------------------------------------------
Version 1
(ST, global new)                    6.65         0.7
-----------------------------------------------------------
Version 2
(ST, linear)                        4.55         0.7
-----------------------------------------------------------
Version 3
(MT, global new)                    1.85         0.7
-----------------------------------------------------------
Version 4
(MT, linear)                        1.72         0.7
-----------------------------------------------------------
Version 5
(MT, linear, size aligned to 64)    1.65         0.9
-----------------------------------------------------------
Version 6
(MT, thread-local linear)           1.45         0.7
-----------------------------------------------------------
Version 7
(MT, thread-local linear & bucket)  1.25         0.7
-----------------------------------------------------------

Speedup going from global new to linear (ST): 1.46x
Speedup going from global new to linear (MT): 1.48x
Speedup going from naïve single-threaded (Version 1) to multi-threaded (Version 7): 5.32x

Conclusion

What can we conclude by looking at the different implementations above? There are a few things that come to mind:

Be aware of how memory is allocated, and how it is accessed in the code.
Be aware of false sharing.
Atomic operations aren’t free.
Last but not least: Instead of trying to make synchronization points for shared data as short as possible, try to not use shared data in the first place. Not sharing mutable data is better than having any kind of synchronization. Double-buffering can help. Thread-local data can help.

Those are things I kept in mind when designing the API outlined in Part 3. I tried to make changing the allocation scheme as simple and non-intrusive as possible, because I knew that those things were going to play a role once we switch to a fully multi-threaded implementation.

25 thoughts on “Stateless, layered, multi-threaded rendering – Part 4: Memory Management & Synchronization”

Rémi Gillig (@remigillig) on February 13, 2015 at 6:07 pm said:

Very interesting and full of details, thanks !

Reply ↓
- Stefan Reinalter on February 15, 2015 at 12:50 am said:
  
  Thank you!
  
  Reply ↓
a0a on March 10, 2015 at 12:11 pm said:

A very nice series of posts! Thanks for sharing.

Reply ↓
youaresoomean on May 22, 2015 at 3:51 pm said:

Really enjoyed the series. One question, how to handle large resource/buffer updates in a multi-threaded way? Things like uploading texture or vertex/index buffers? Dependening on the API – which might or might not allow free-threaded resource handling – copy data into staging memory which is then uploaded from the render thread if no concurrent resource creation is allowed?

Reply ↓
- Stefan Reinalter on May 22, 2015 at 6:05 pm said:
  
  Your solution is one possible way to do it. Another would be to use double-buffering, if you can afford the memory for it.
  
  It also depends on the platform you’re working with. On some platforms (consoles), you have direct access to the GPU ringbuffer, and can directly put the data there, which is useful for certain things that can be rendered “immediate mode” style.
  
  Reply ↓
Lee (@EschersKazoo) on December 2, 2015 at 4:50 pm said:

Am I right in assuming that neither of ‘m_tlsRemaining’ and ‘m_tlsOffset’ are defined as thread-local, but rather that they’re normal arrays aligned to a 64 byte boundary and padded to 64 bytes per element to avoid false sharing? I think the ”m_tls’ prefix through me off for a second there.

Reply ↓
- Stefan Reinalter on December 2, 2015 at 5:17 pm said:
  
  Yes, you are right. I kept the “m_tls” prefix because they are accessed using a thread-local integer, and are thread-local in a sense that each thread gets its own copy of remaining and offset per bucket.
  
  Reply ↓
Dihara on August 29, 2016 at 8:23 am said:

Hi Stefan! Amazing blog! Has helped me understand so many things.
Anyways i’ve a few questions i want to ask.

1. In your RenderSystem, do you perform frustum culling on all StaticMeshComponents and SkinnedMeshComponents and submit Draw Commands to the Render Backend?

2. How are the components created? is it through the World? as in something like :
world->CreateComponent(entity);
Or are they created via the Systems?

These are just some of the questions off the top of my head. I’ve been reading your blog for a few months and it’s been really inspiring.

Reply ↓
- Stefan Reinalter on August 30, 2016 at 6:36 pm said:
  
  1. In your RenderSystem, do you perform frustum culling on all StaticMeshComponents and SkinnedMeshComponents and submit Draw Commands to the Render Backend?
  
  Yes, all components are culled by the respective system but not rendered yet. That is up to the renderer – it’s the one to decide how they should be rendered.
  
  2. How are the components created? is it through the World? as in something like :
  world->CreateComponent(entity);
  Or are they created via the Systems?
  
  In my current design, components are created by the system that manages them. There is a generic function for creating components that forwards/calls the correct function on a system using compile-time traits for the components & systems in question.
  
  Reply ↓
  - Dihara on September 1, 2016 at 7:02 pm said:
    
    Thank you! That cleared up a lot of things! One more question :
    Since the System’s create components i’m assuming they keep the components in some kind of array. If so, how do you implement system’s that depend on more than one type of Component, such as Transform and Physics? Keeping copies doesn’t sound very efficient to me since you’d have to copy them back if they were updated in a system that does not have ownership over that type of component.
  - Stefan Reinalter on September 6, 2016 at 10:40 pm said:
    It really depends on the access and usage patterns, but I try to keep such systems (and components) to an absolute minimum.
    There are two main possibilities:
    
    Cache a pointer to the component’s data. Every time the component changes its place in the system’s array, all “holders” need to be notified. Note that the component should only ever be used to read data from.
    
    Store a component’s ID and use that for accessing the component’s data.
    
    If you need to access the component’s data every frame, the first method will be faster. If you create/delete components a lot, the second method is probably faster.
chillonmathieu on November 1, 2016 at 11:01 pm said:

Hi Stefan !

In the previous article of this series you said that you would talk about the way to efficiently generate a key but in the end, you did not.
So i’m asking these questions: what does a key look like and how do we efficiently generate one?

By looking at the article by Christer Ericson that you mentionned in the first part, it would appear to me that using a bit field could do the trick. Unfortunately by using this design, it seems impossible to change dynamically the position of some informations within the key.
I could use some sort of “array of bits”, but now, how do I know where an information starts and where it ends? Carrying offsets with the key is overkill.

As for the key generation, it’s mainly the depth that bothers me. Christer Ericson stores it in 24 bits but a float is 32. Is there a way to “compress” a float into 24 bits? Without losing (too much) precision?

These are some of the question for which I have no answers yet and I’m stuck in my game engine development so I’d like to have some hints from you!
Thanks in advance.

Reply ↓
- Stefan Reinalter on November 4, 2016 at 1:35 pm said:
  
  In the previous article of this series you said that you would talk about the way to efficiently generate a key but in the end, you did not.
  So i’m asking these questions: what does a key look like and how do we efficiently generate one?
  
  Sorry about that, I never found the time to write another post on the subject. In the meantime, you can look at bgfx for an efficient implementation.
  
  By looking at the article by Christer Ericson that you mentionned in the first part, it would appear to me that using a bit field could do the trick. Unfortunately by using this design, it seems impossible to change dynamically the position of some informations within the key.
  I could use some sort of “array of bits”, but now, how do I know where an information starts and where it ends? Carrying offsets with the key is overkill.
  
  Bitfields are one option, but it’s usually faster to generate keys by manually shifting and masking the individual parts of the key into one integer. Don’t forget that you can have some kind of dynamism by putting different encode/decode mechanisms into their own functions, switching functions at run-time (e.g. via function pointers) if you need that. Or did you mean something else?
  
  As for the key generation, it’s mainly the depth that bothers me. Christer Ericson stores it in 24 bits but a float is 32. Is there a way to “compress” a float into 24 bits? Without losing (too much) precision?
  
  That’s actually pretty easy. If you take a look at the IEEE-754 format for single-precision floating-point numbers, you will notice that they are made up of 1 bit sign, 8 bits exponent and 23 bits mantissa. This means that you can convert a 32-bit float in that range into a 24-bit integer without losing any information at all – you don’t need the exponent.
  
  Hope that helps!
  
  Reply ↓
  - chillonmathieu on November 4, 2016 at 1:57 pm said:
    
    This helps a lot! Thank you for the answer!
Simon on April 25, 2017 at 8:29 am said:

Hi Stefan, For the purpose of Version 6 I’m wondering about the “how” you give each thread its own linear allocator. Do you actually use the linear allocator class from your memory management series (how to use it for TLS?), or is it something more simple along the lines of just having a thread_local array of a fixed size? What would a good size be for the latter?

Reply ↓
- Stefan Reinalter on April 25, 2017 at 1:00 pm said:
  
  Do you actually use the linear allocator class from your memory management series (how to use it for TLS?)
  
  Yes.
  
  or is it something more simple along the lines of just having a thread_local array of a fixed size?
  
  It’s something similar, linear allocators in thread-local storage. How much memory each of these allocators consumes heavily depends on the kind of project you’re doing, but you can always make the linear allocators growable, if you want. Then you don’t have to preallocate memory needed by the allocators, but can allocate new pages on-the-fly.
  
  Reply ↓
Simon on April 26, 2017 at 7:46 am said:

Thank you, Stefan! Your blog is fantastic! I want to try that out but have hit a bit of a snag with the compiler (VS 2013). Apparently it does not support non-POD objects (which my linear allocator is) as thread-local (using __declspec(thread)). C2483: ‘foo’ : object with constructor or destructor cannot be declared ‘thread’. So I’m considering either looking for a workaround or having to switch to a newer compiler (that supports thread_local).

Reply ↓
- Stefan Reinalter on April 28, 2017 at 2:45 pm said:
  
  You can always use platform-specific APIs like e.g. TlsAlloc & TlsFree, or switch to a compiler that supports C++11’s thread_local keyword.
  
  Reply ↓
yyc on June 1, 2017 at 5:24 am said:

very helpful article! Thanks for your sharing!

Reply ↓
Daniel on November 24, 2017 at 9:29 am said:

Thank you for sharing knowleadge with us!
I have one question about false sharing. When I use a task sheduler similar to your design, and each thread is locked on unique core (one thread per core), do I need to protect my code from false sharing? I read that false sharing occurs only when two or more threads are running on the same core.

Reply ↓
- Stefan Reinalter on November 24, 2017 at 3:00 pm said:
  
  I read that false sharing occurs only when two or more threads are running on the same core.
  
  No, unfortunately, that is wrong. It is the exact opposite.
  False sharing won’t hit you if threads run on the same core. But as soon as threads start running on several cores, the cores need to keep their (L1) caches synchronized, which is what causes the performance degradation in case of false sharing (two threads hitting distinct memory addresses that happen to share a cache line).
  
  Reply ↓
Foo on July 21, 2019 at 1:44 pm said:

This may be a stupid question, but why don’t you just create a command bucket per thread? You could merge those once all threads are done.

Reply ↓
- Stefan Reinalter on August 2, 2019 at 4:30 pm said:
  
  That would definitely be an option too.
  
  Reply ↓
Michal on March 19, 2020 at 2:03 pm said:

I have got an opengl book that describes same concept of command buffer, but without sorting and in its simples case it is almost as fast as your the base multithreaded case. Adding commands takes ~1.4ms and submitting ~0.7ms without any multithreading. Where the simplest example of yours on my computer is ~12.0ms adding and ~4.5/4.6ms submitting. I have also tried your code with linear allocator but it still cannot compete with code from the book.
To test this out I run program multiple times
40k drawing commands (only function simulation, aint no backend calls)
my CPU i5 7600k 4.5GHz

here’s the code https://pastebin.com/LFQMntKi

any suggestions?

Reply ↓
- Stefan Reinalter on March 19, 2020 at 2:56 pm said:
  
  You are comparing apples and oranges here.
  First, your approach with the AllPackets union allows you to allocate a contiguous array of packets at once. With my approach, each packet/command can have a different size and optional data which means that commands cannot be constructed up-front.
  Second, all your commands only need a uint32 at the moment. Once you have different commands, they are going to need different data. What this means in your approach is that the sizeof(AllPackets) will increase, no matter which packet type you store in them. So if your largest packet needs 64 bytes, you’re going to spend 64 bytes on *any* packet type, which is bad. If your most complex packet needs 64 byte, you’re going to load 64 bytes into the cache, no matter if you’re touching a BindShader or Draw packet, even though those only need 4 bytes. And you’re going to waste a lot of memory.
  Third, in my approach you can add completely new (game-specific) commands without having to touch the engine code. In your approach, somebody has to touch the engine’s AllPacket union and add new types, which can completely change the memory consumption and performance of *existing* code, due to point #2 above.
  
  Reply ↓