Stateless, layered, multi-threaded rendering – Part 3: API Design Details

Posted on December 16, 2014 by Stefan Reinalter

In the previous part of this series, I’ve talked a bit about how to design the stateless rendering API, but left out a few details. This time, I’m going to cover those details as well as some questions that came up in the comments in the meantime, and even show parts of the current implementation.

Command buckets

In the first part of the series, I introduced the idea that not all layers need to store keys of the same size, e.g. a layer for rendering objects into a shadow map might only need a 16-bit key, while the G-Buffer layer requires 64-bit keys.

In Molecule parlance, the thing responsible for storing draw calls (and their associated key and data) is called a CommandBucket, which is a light-weight class template taking the key-type as a template parameter:

template <typename T>
class CommandBucket
{
  typedef T Key;
  ...
};

A renderer would create all the CommandBuckets it needs for rendering the scene, and e.g. store them as members, populating them with draw calls in its Render() method. Or you could make creation & destruction of the CommandBuckets completely data-driven and configurable.

But what does a CommandBucket need to store, and how do we store it?

N keys that are used for sorting the N draw calls
Data for N draw calls

Note that the amount and type of data stored for each draw call heavily depends on the type of draw call. Therefore, all data needed by a draw call is stored in a separate memory region, and the CommandBucket only stores a pointer to that region:

template <typename T>
class CommandBucket
{
  typedef T Key;
  ...

private:
  Key* m_keys;
  void** m_data;
};

Note that the keys and their data are stored in two separate arrays. The reason for that is that certain sorting algorithms don’t swap during the sort operation but rather populate a separate array with indices to the sorted entries, so they have to touch less data.

When creating a CommandBucket, it takes care of allocating the arrays of keys and data pointers, and also stores all the render targets as well as the view matrix, projection matrix and viewport to be used when submitting the commands stored in that bucket. The rationale behind this is that it is very likely that you render into a certain layer using the same camera & viewport, so it makes no sense to specify that information in each and every draw call. Furthermore, this means that each CommandBucket can only hold a certain number of draw calls, specified when creating the bucket like in the following example:

CommandBucket<uint64_t> gBufferBucket(2048, rt1, rt2, rt3, rt4, dst, viewMatrix, projMatrix);

Of course, the view and projection matrix would most likely be provided by a camera object, or some kind of camera system – but that is an entirely different topic.

Commands

Now that we have buckets for storing draw calls, how do we add commands to a bucket? And what exactly is a command?

A command is a self-contained piece of information that is understood by the render backend, and is stored in a command bucket. A command could identify any kind of draw call (non-indexed, indexed, instanced, …), or any other operation such as copying data into a constant buffer.

Each command is a simple POD that holds all the data needed by the backend in order to carry out the operation associated with a certain command. The following three structs are all examples of simple commands:

namespace commands
{
  struct Draw
  {
    uint32_t vertexCount;
    uint32_t startVertex;

    VertexLayoutHandle vertexLayoutHandle;
    VertexBufferHandle vertexBuffer;
    IndexBufferHandle indexBuffer;
  };

  struct DrawIndexed
  {
    uint32_t indexCount;
    uint32_t startIndex;
    uint32_t baseVertex;

    VertexLayoutHandle vertexLayoutHandle;
    VertexBufferHandle vertexBuffer;
    IndexBufferHandle indexBuffer;
  };

  struct CopyConstantBufferData
  {
    ConstantBufferHandle constantBuffer;
    void* data;
    uint32_t size;
  };
}

Because each command is a separate POD, putting them into a command bucket becomes simple. We can add a method that takes a key, makes space for storing the command, stores the pointer to the data in the internal array, and hands the POD instance to the user:

template <typename U>
U* CommandBucket::AddCommand(Key key)
{
  U* data = AllocateCommand<U>();

  // store key and pointer to the data
  AddKey(key);
  AddData(data);

  return data;
}

This is still really simple, but there are also a few bits we haven’t talked about yet, such as adding synchronization when accessing the arrays, and how to allocate memory for the command. We will revisit this topic later, because for now there are more important things we need to talk about first.

For now, assume that we have a command bucket which we populate in the following manner:

for (size_t i=0; i < meshComponents.size(); ++i)
{
  MeshComponent* mesh = &meshComponents[i];

  commands::DrawIndexed* dc = gBuffer.AddCommand<commands::DrawIndexed>(GenerateKey(mesh->aabb, mesh->material));
  dc->vertexLayoutHandle = mesh->vertexLayout;
  dc->vertexBuffer = mesh->vertexBuffer;
  dc->indexBuffer = mesh->indexBuffer;
  dc->indexCount = mesh->indexCount;
  dc->startIndex = 0u;
  dc->baseVertex = 0u;
}

Compared to the two alternatives presented in the last post, note that there is no need for calling another method after a draw call has been created and inserted into the bucket. After calling AddCommand(), the command completely belongs to you, and you simply fill all its members. That’s it. All store operations would write directly into one contiguous chunk of memory, without any additional copy operations – but more on that later.

The command bucket responsible for draw calls contributing to the G-Buffer now holds an indexed draw call for each mesh component. After all buckets have been populated, we can sort them by their keys:

gBufferBucket.Sort();
lightingBucket.Sort();
deferredBucket.Sort();
postProcessingBucket.Sort();
hudBucket.Sort();

For sorting the commands in the bucket, we can use whatever sorting algorithm we want. The one thing to note here is that each CommandBucket::Sort() can be run on a different thread, sorting all buckets in parallel.

After all buckets have been sorted, we can submit them to the render backend:

gBufferBucket.Submit();
lightingBucket.Submit();
deferredBucket.Submit();
postProcessingBucket.Submit();
hudBucket.Submit();

The submission process has to be done from one thread because it constantly talks to the graphics API (D3D, OGL), submitting work to the GPU. It does not matter whether it is the main thread or a dedicated rendering thread, though.

Submission process

But how do we submit the commands to the graphics API? All we have is a key and a pointer to the associated data. This is clearly not enough, so we need some kind of additional identifier for each command.

One way of implementing this would be to add an identifier (e.g. an enum value) to each command, store it alongside the key and data, and then implement the Submit() method similar to the following piece of code:

void Submit(void)
{
  SetViewMatrix();
  SetProjectionMatrix();
  SetRenderTargets();

  for (unsigned int i=0; i < commandCount; ++i)
  {
    Key key = m_keys[i];
    void* data = m_data[i];
    uint16_t id = m_ids[i];

    // decode the key, and set shaders, textures, constants, etc. if the material has changed.
    DecodeKey();

    switch (id)
    {
      case command::Draw::ID:
        // extract data for a Draw command, and call the backend
        break;

      case command::DrawIndexed::ID:
        // extract data for a DrawIndexed command, and call the backend
        break;

      ...;
    }
  }
}

This would be a possible solution, however I would not recommend it. Why?

First, we kind of do the same thing twice. We once identify the command when storing it into the bucket (e.g. by storing U::ID into our array of IDs, m_ids), and then identify it again in the huge switch statement.

Second, the hardcoded switch statement makes it hard and tedious to add new commands, and impossible to add user-defined commands if we don’t have access to the source code.

There is a better and simpler solution: function pointers.

Backend dispatch

Instead of storing the ID of a command in the bucket, we can directly store a pointer to a function that knows how to deal with a certain command, and forwards it to the render backend. This is what is known as the Backend Dispatch in Molecule.

The backend dispatch is a namespace that consists of simple forwarding functions only:

namespace backendDispatch
{
  void Draw(const void* data)
  {
    const commands::Draw* realData = union_cast<const commands::Draw*>(data);
    backend::Draw(realData->vertexCount, realData->startVertex);
  }

  void DrawIndexed(const void* data)
  {
    const commands::DrawIndexed* realData = union_cast<const commands::DrawIndexed*>(data);
    backend::DrawIndexed(realData->indexCount, realData->startIndex, realData->baseVertex);
}

  void CopyConstantBufferData(const void* data)
  {
    const commands::CopyConstantBufferData* realData = union_cast<const commands::CopyConstantBufferData*>(data);
    backend::CopyConstantBufferData(realData->constantBuffer, realData->data, realData->size);
  }
}

Each function in the backend dispatch has the same signature, hence we can use a typedef to store a pointer to any of those functions:

typedef void (*BackendDispatchFunction)(const void*);

The functions contained in the backend namespace are still the only ones that talk directly to the graphics API, e.g. by using the D3D device.

So let’s go back and revisit the CommandBucket and its AddCommand() method. In addition to the command, we now also need to store a pointer to the dispatch function. Actually, we also need to store two more things we haven’t talked about yet in addition to the above:

The first is a pointer to any other command that needs to be submitted with this command and has the same key. If we store a pointer to another command we build an intrusive linked list that allows us to handle draw calls and commands that always need to be submitted in a certain order, no matter what key was assigned to them. This came up more than once as a question in the comments, and is needed when submitting draw calls where we e.g. first need to copy data to a constant buffer, and then submit the draw call. The intrusive linked list allows us to chain any number of commands.

The second is that certain commands need auxiliary memory to store intermediate data that is needed when submitting the draw call to the API at a later point in time. The perfect example for this is updating a constant buffer with a few bytes of data, such as e.g. lighting information. These bytes are tucked away in the auxiliary memory, and copied from there into the constant buffer when the command is submitted.

Command packets

Because we no longer just store single commands in the bucket, we introduce the concept of command packets. A bucket now stores command packets, and each packet holds the following data:

void* : a pointer to the next command packet (if any)
BackendDispatchFunction : a pointer to the function responsible for dispatching the call to the backend
T : the actual command
char[] : auxiliary memory needed by the command (optional)

Whenever the user wants to add a command of type T to the bucket, we need to make space for the other things as well. For that, I simply allocate raw memory which is large enough to hold all the data using appropriate sizeof() operators, and cast the individual parts to their desired type. In order for that to work, a few static_asserts ensure that all commands are POD structs.

Finally, a helper namespace takes care of doing all the offset calculations and casting:

typedef void* CommandPacket;

namespace commandPacket
{
  static const size_t OFFSET_NEXT_COMMAND_PACKET = 0u;
  static const size_t OFFSET_BACKEND_DISPATCH_FUNCTION = OFFSET_NEXT_COMMAND_PACKET + sizeof(CommandPacket);
  static const size_t OFFSET_COMMAND = OFFSET_BACKEND_DISPATCH_FUNCTION + sizeof(BackendDispatchFunction);

  template <typename T>
  CommandPacket Create(size_t auxMemorySize)
  {
    return ::operator new(GetSize<T>(auxMemorySize));
  }

  template <typename T>
  size_t GetSize(size_t auxMemorySize)
  {
    return OFFSET_COMMAND + sizeof(T) + auxMemorySize;
  };

  CommandPacket* GetNextCommandPacket(CommandPacket packet)
  {
    return union_cast<CommandPacket*>(reinterpret_cast<char*>(packet) + OFFSET_NEXT_COMMAND_PACKET);
  }

  template <typename T>
  CommandPacket* GetNextCommandPacket(T* command)
  {
    return union_cast<CommandPacket*>(reinterpret_cast<char*>(command) - OFFSET_COMMAND + OFFSET_NEXT_COMMAND_PACKET);
  }

  BackendDispatchFunction* GetBackendDispatchFunction(CommandPacket packet)
  {
    return union_cast<BackendDispatchFunction*>(reinterpret_cast<char*>(packet) + OFFSET_BACKEND_DISPATCH_FUNCTION);
  }

  template <typename T>
  T* GetCommand(CommandPacket packet)
  {
    return union_cast<T*>(reinterpret_cast<char*>(packet) + OFFSET_COMMAND);
  }

  template <typename T>
  char* GetAuxiliaryMemory(T* command)
  {
    return reinterpret_cast<char*>(command) + sizeof(T);
  }

  void StoreNextCommandPacket(CommandPacket packet, CommandPacket nextPacket)
  {
    *commandPacket::GetNextCommandPacket(packet) = nextPacket;
  }

  template <typename T>
  void StoreNextCommandPacket(T* command, CommandPacket nextPacket)
  {
    *commandPacket::GetNextCommandPacket<T>(command) = nextPacket;
  }

  void StoreBackendDispatchFunction(CommandPacket packet, BackendDispatchFunction dispatchFunction)
  {
    *commandPacket::GetBackendDispatchFunction(packet) = dispatchFunction;
  }

  const CommandPacket LoadNextCommandPacket(const CommandPacket packet)
  {
    return *GetNextCommandPacket(packet);
  }

  const BackendDispatchFunction LoadBackendDispatchFunction(const  CommandPacket packet)
  {
    return *GetBackendDispatchFunction(packet);
  }

  const void* LoadCommand(const CommandPacket packet)
  {
    return reinterpret_cast<char*>(packet) + OFFSET_COMMAND;
  }
};

Note that Create() uses the global operator new for allocating raw memory. In a real implementation, we would use our own linear allocator that ensures that all commands are stored in memory contiguously, which is much more cache-friendly when we need to iterate through the commands in the Submit() method.

Revisiting the command bucket

With command packets in place, the actual code for adding commands to a bucket becomes the following:

template <typename U>
U* AddCommand(Key key, size_t auxMemorySize)
{
  CommandPacket packet = commandPacket::Create<U>(auxMemorySize);

  // store key and pointer to the data
  {
    // TODO: add some kind of lock or atomic operation here
    const unsigned int current = m_current++;
    m_keys[current] = key;
    m_packets[current] = packet;
  }

  commandPacket::StoreNextCommandPacket(packet, nullptr);
  commandPacket::StoreBackendDispatchFunction(packet, U::DISPATCH_FUNCTION);

  return commandPacket::GetCommand<U>(packet);
}

Once we take care of the TODO marked above, we can also start adding commands from any number of threads. As a first implementation, we can simply add a critical section to make the code work, but obviously there are better solutions, which is something I would like to write about in one of the next posts in the series.

Of course, each command now also needs to store a pointer to the backend dispatch, exemplified for the draw command:

struct Draw
{
  static const BackendDispatchFunction DISPATCH_FUNCTION;

  uint32_t vertexCount;
  uint32_t startVertex;

  VertexLayoutHandle vertexLayoutHandle;
  VertexBufferHandle vertexBuffer;
  IndexBufferHandle indexBuffer;
};
static_assert(std::is_pod<Draw>::value == true, "Draw must be a POD.");

const BackendDispatchFunction Draw::DISPATCH_FUNCTION = &backendDispatch::Draw;

Custom commands

As stated earlier, using function pointers this way allows us to support user-defined commands as well. For example, you can make up your own commands by defining a POD, implement a completely custom dispatch function for it, and add that command to any bucket, or even chain it to other commands.

Chaining commands

Now that our command packet also stores a pointer to the next command packet, we can append commands to other commands:

template <typename U, typename V>
U* AppendCommand(V* command, size_t auxMemorySize)
{
  CommandPacket packet = commandPacket::Create<U>(auxMemorySize);

  // append this command to the given one
  commandPacket::StoreNextCommandPacket<V>(command, packet);

  commandPacket::StoreNextCommandPacket(packet, nullptr);
  commandPacket::StoreBackendDispatchFunction(packet, U::DISPATCH_FUNCTION);

  return commandPacket::GetCommand<U>(packet);
}

Note that in this situation we don’t need to store a new key/value pair into our array, because each command that is appended to another one needs to have the same key anyway.

The following example shows how commands can be chained together using the new command bucket API:

for (unsigned int i=0; i < directionalLights.size(); ++i)
{
  PerDirectionalLightConstants constants =
  { directionalLights[i].diffuse, directionalLights[i].specular };

  commands::CopyConstantBufferData* copyOperation = lightingBucket.AddCommand<commands::CopyConstantBufferData>(someKey, sizeof(PerDirectionalLightConstants));

  copyOperation->data = commandPacket::GetAuxiliaryMemory(copyOperation);
  copyOperation->constantBuffer = directionalLightsCB;
  memcpy(copyOperation->data, &constants, sizeof(PerDirectionalLightConstants));
  copyOperation->size = sizeof(PerDirectionalLightConstants);

  commands::Draw* dc = lightingBucket.AppendCommand<commands::Draw>(copyOperation, 0u);
  dc->vertexCount = 3u;
  dc->startVertex = 0u;
}

Revisiting the submission process

Of course, with the command packets in place, the Submit() method also needs to be adapted. By using the backend dispatch, we can get rid of the switch statement, and the linked list of commands can be walked using a simple loop:

void Submit(void)
{
  // ... same as before
  for (unsigned int i=0; i < m_current; ++i)
  {
    // ... same as before
    CommandPacket packet = m_packets[i];
    do
    {
      SubmitPacket(packet);
      packet = commandPacket::LoadNextCommandPacket(packet);
    } while (packet != nullptr);
  }
}

void SubmitPacket(const CommandPacket packet)
{
  const BackendDispatchFunction function = commandPacket::LoadBackendDispatchFunction(packet);
  const void* command = commandPacket::LoadCommand(packet);
  function(command);
}

Recap

This post is quite long already, much longer than I anticipated. But still, let us recap which concepts this post introduced:

Commands: a self-contained piece of information that is handed to the backend dispatch. Each command resembles one simple operation such as an indexed draw call, copying data to a constant buffer, etc. Each command is implemented as a POD struct.
Backend dispatch: simple forwarding functions that extract data from a command, and forward them to the graphics backend. Each dispatch function deals with a different command.
Command packets: a command packet stores a command, along with additional data such as a pointer to a dispatch function, any auxiliary memory a command might need, and an intrusive linked list for chaining commands.
Chaining of commands: commands that need to be submitted in a certain order can be chained together.
Command bucket: A command bucket stores command packets along with a key of any size.
Multi-threaded rendering: Commands can be added to buckets in parallel from multiple threads. The only two points of synchronization are the memory allocation, and storing the key-value pair into the array of command packets.
Multi-threaded sorting: Each command bucket can be sorted independently, in parallel.

Even though this is already part 3 of the series, there are still things we haven’t talked about in detail yet:

Memory management: How do we allocate the memory for storing the keys and pointers to packets? How do we efficiently allocate memory for individual command packets in the case of multiple threads adding commands to the same bucket? How can we ensure good cache utilization throughout the whole process of storing and submitting command packets? Can we use one contiguous chunk of memory?
Key generation: Which information does a key hold? How do we efficiently build a key?

So for now, our rendering process is stateless and layered/bucketized, but its multi-threaded rendering capabilities can still be greatly improved. Until next time!

37 thoughts on “Stateless, layered, multi-threaded rendering – Part 3: API Design Details”

Alexander Plaksin on January 22, 2015 at 7:50 am said:

Is there some functionality for HW instancing of randomly (space and time) added instances?

Reply ↓
- Stefan Reinalter on January 22, 2015 at 3:19 pm said:
  
  Unfortunately I’m not sure if I understood your question correctly. Which functionality are you talking about? HW instancing functionality in general? Functionality in the Molecule Engine? Or functionality regarding this blog post? Can you maybe elaborate a bit more?
  
  Reply ↓
  - Alexander Plaksin on January 23, 2015 at 1:34 pm said:
    
    My english is very bad? 😦
    How can I draw millions copy of same object by stateless rendering API?
    But there is one restriction: its impossible use command DrawInstanced (for example), because of CPU visibility check by octree (for example :))
  - Stefan Reinalter on January 28, 2015 at 1:31 pm said:
    
    There are two options for drawing many objects using the stateless rendering API.
    
    Option 1: Build functionality into the submission backend of the stateless API which is capable of detecting similar draw calls, disecting the data, and submitting it all in one go with something like DrawInstanced. This is harder to do than Option 2, and cannot be fully done generically because shaders and other parts of the pipeline need to be aware of instanced rendering as well. You cannot just tack instanced rendering onto anything if your shaders are not aware of it.
    
    Option 2: Support DrawInstanced as one of the many draw calls your API will have to support anyway. For instanced geometry “known to the editor” (for lack of a better term) such as vegetation, instanced buildings in a large city, etc. you just issue the DrawInstanced calls yourself. This is of course much easier to pull off than Option 1, but needs to have instancing built into the engine right from the start. This is the option I prefer.
    
    Also, I do not understand why DrawInstanced is impossible to use when visibility checks are done by the CPU? To me, doing culling on the CPU is better for doing instanced rendering than when doing culling on the GPU. For each geometry/entity that is determined as being visible, you would append the data to a command buffer/structured buffer/texture (depends on how you do the instanced rendering), and then submit one DrawInstanced draw call.
Alexander Plaksin on January 23, 2015 at 1:40 pm said:

Is command bucket has one global render state (blend, depth, etc) for all contained command?

Reply ↓
- Stefan Reinalter on January 28, 2015 at 1:45 pm said:
  That would certainly be possible, yes.
  
  Advantages:
  - Commands need less memory because you don’t have to carry around render state information.
  - Sorting also becomes easier/faster, because there are more buckets that can be sorted in parallel.
  - Additionally, submission should be faster because render states can only change between the submission of two distinct buckets.
  Disadvantages:
  - It leads to more command buckets being created, potentially that could be a lot more.
  - Might make it harder to fully data-drive render states for operations, and e.g. expose them in the UI in the editor.
  Reply ↓
  - Nick on September 2, 2017 at 4:07 pm said:
    
    Great series, however I do have some questions about this design.
    
    If this design is stateless, how are you setting the state? (My understanding is that with the DrawCommand you would need to pass all the state with it, right?).
    
    How do you go about sorting your commands? If your commands are discrete chunks (ClearCommand, DrawCommand etc) you need to ensure that one happens before the other, which you could do with chaining; although you would end up with one huge change that wouldn’t be possible to sort because its all interconnected.
    
    Thanks 🙂
  - Stefan Reinalter on September 7, 2017 at 10:16 am said:
    
    Yes, every draw call carries all the state which is set when the draw call is finally “executed”.
    Sorting is done using integer keys and for cases where command A needs to be done before command B, you can introduce additional bits to carry that information.
    You can also make the clear command data part of your state (e.g. for a certain view) and automatically clear the view before its draw calls are submitted.
    
    For a real-world example with source code, check out bgfx. Well-known library that comes with lots of examples and render backends.
Pingback: Stateless, layered, multi-threaded rendering – Part 4: Memory Management & Synchronization | Molecular Musings
LooseBits on March 11, 2015 at 12:10 pm said:

I have to say that this is an interesting way of designing stateless API for rendering. Not sure if I missed this one, but how this kind of design would handle rendering resource creation / update? E.g. if you would like to have procedurally generated texture and you would need to update it constantly? For example in Christer Ericson’s or BGFX’s method you could just push “update texture command” + texture data ptr in queue and call submit / flush.

Reply ↓
- Stefan Reinalter on March 13, 2015 at 5:38 pm said:
  
  You would do it quite similarly using this design. There would be another command (similar to Draw and DrawIndexed) which you would add to a bucket, with all its parameters – quite similar to the CopyConstantBufferData command shown in the blog.
  When this command is executed in the backend, the provided data is then copied to the texture.
  
  Reply ↓
Eyal Kalderon on April 23, 2015 at 6:50 am said:

Been reading your blog for years. Great source of information!

How would you personally implement the renderer backend in an API agnostic way? Would you wrap the target low-level API or APIs in a traditional, stateful “device” and “device context,” in the Direct3D sense? I doubt this would be maintainable if porting to OpenGL, or perhaps Vulkan. Would you take an alternative approach?

Reply ↓
- Stefan Reinalter on April 23, 2015 at 7:15 pm said:
  
  I would not wrap the API, but have the renderer backend call directly into the API. The renderer backend is then responsible for replaying the command stream in a way that’s as efficient as possible, making use of all the functions available in D3D, Vulkan, etc.
  For the command stream system I would try to keep everything as low-level as possible (looking at Mantle, Vulkan, Metal) and use that as a baseline for figuring out what things to offer in the command stream. Everything else should be built on top of that. From experience, it is much easier to put higher-level things on top of low-level things, rather than the other way around.
  
  Reply ↓
  - Eyal Kalderon on April 24, 2015 at 1:39 am said:
    
    I agree completely. But is it possible to cleanly map concepts from lower-overhead APIs (e.g. Vulkan, Metal, Mantle) to more traditional ones like OpenGL and legacy Direct3D? Supporting both modern and traditional APIs through a single, unified backend interface without any kind of wrapping would be difficult.
  - Stefan Reinalter on April 30, 2015 at 12:39 am said:
    
    Yes, I think it is possible, but the commands the user puts into the command queue then possibly become something entirely different. E.g. maybe we no longer deal with SetState, DrawIndexed, etc. on a command-based level. Of course, knowing beforehand what the APIs are going to look like helps.
Alexander Plaksin on April 26, 2015 at 9:35 am said:

Hello!
Do you have a command something like SetMaterial that sets shaders, textures, and so on? If this function is called many times with the same material as they were then sorted?
For example, we have many meshes with same material in a scene. Render scene like this:
for_each(mesh in meshes)
{
command_bucket.add_command(new SetMaterialCommand(mesh.material));
command_bucket.add_command(new DrawCommand(mesh.geometry));
}
I do not understand a bit.
Thanks.

Reply ↓
- Stefan Reinalter on April 30, 2015 at 12:51 am said:
  
  No, I would not do it with a command that sets textures, states, etc. I would rather directly add this data to the draw call (the structs introduced in the post), or put e.g. the material ID in the key which is used for sorting.
  
  Remember that we want to keep the rendering part (submitting draw calls) stateless, so that setting a state somewhere doesn’t affect all draw calls surrounding it.
  
  Reply ↓
SirTimothy on April 29, 2015 at 10:25 pm said:

Hey, great series of articles! I’ve been working on a stateless gpu layer of my own. It has a slightly different form, but these articles served as motivation and inspiration.

I was wondering how you deal with updating resource data, such as dynamic vertex buffers, constant buffers, etc. In one of your examples, you show a CopyConstantBufferData command, which appears to copy from some memory buffer to the constant buffer. But this could result in extra copying (if somewhere in the renderer you collect data from various sources, put those into a temporary chunk of memory, and then when executing the command list/bucket/packet, it gets copied from the temp memory into the buffer resource itself). It would be nice instead to be able to build the data directly into the locked buffer.

One idea I’ve come up with is to provide a callback in the command data, so that when the BuildConstantBufferData command executes, it could lock the buffer, call the callback function to build the data in-place, and then unlock. This may work for some cases, but doesn’t seem like the nicest thing. Plus, if I want to build the constant buffer _while_ building the command list, I can’t really do that. I may end up having to traverse my scene an extra time in the callback (which also means a potentially significant amount of time when I’m not submitting draw calls).

If I could be sure that a given resource only needs to be updated once, I could just do it outside of any command lists. So maybe the answer is to just limit myself to updating any resource once, and use some pool of extra buffers that I can cycle through as needed. Or just settle for possibly copying data an extra time.

Thanks!

Reply ↓
- Stefan Reinalter on April 30, 2015 at 1:18 am said:
  I touched upon this subject in my first answer to a comment in the first post of this series, although in a different context, regarding CPU & GPU synchronization.
  
  I would say you basically have two options:
  1. Be on the safe side, and do the temporary copy like illustrated in the post. This means that dynamic data is temporarily copied to a memory location held by the command bucket, only to be used for updating some kind of GPU resource before the draw call is submitted. As said, this needs more memory and more copy operations. On the positive side, it is safe to use, because the programmer does not need to think about the lifetime of his data once it has been submitted into the bucket.
  2. Offer a second, alternative version of e.g. CopyConstantBufferData, which does not copy the data, but only stores the pointer to the data. The data is then copied from there when updating the GPU resource. This saves memory and copy operations, but puts more burden on the programmer, because he now has to make sure that the memory location pointed to is still valid when the data is copied in the backend.
  I would not worry about the fact that the CPU first prepares the data and stores it somewhere on the heap, and then the backend copies it into the GPU resource from there. If you really want to support in-place updating of dynamic resources, I would probably add a generic draw call that allows the user to submit a function pointer (or similar) which gets called whenever this command is executed. Then it is up to the user to decide what to do inside that function.
  Reply ↓
Eyal Kalderon on April 30, 2015 at 6:12 am said:
Hi again, Stefan! I have two more questions for you.

I understand how the sort keys are encoded for DrawCommands, but what about CopyConstantBufferData and the like? How do you ensure they are executed before the appropriate DrawCommands? Chain them?

Secondly, would you implement
```
ClearColorCommand
```
,
```
ClearDepthStencil
```
, etc. as commands?
Reply ↓
- Stefan Reinalter on May 2, 2015 at 1:47 am said:
  
  Yes, commands that need to go together are chained together so that they use the same key for sorting.
  And yes, things such as clearing render targets also would need to get their own commands, even though they “break” the stateless rendering system, strictly speaking. As a start, I would therefore add clear parameters to e.g. the layers that define into which render target a bucket should render. So every time a layer gets rendered, it uses the parameters to clear the corresponding render target. If you find that you still need to clear render targets while rendering into them, only then would I add those commands.
  
  Reply ↓
sebastiend on July 10, 2015 at 3:51 pm said:

Hi,

Thanks for these blog posts.

However, I cannot get the idea. Your bucket Submit function implementation simply iterates over the commands array. This assumes a correct ordering of commands. But:
“Note that the keys and their data are stored in two separate arrays. The reason for that is that certain sorting algorithms don’t swap during the sort operation but rather populate a separate array with indices to the sorted entries, so they have to touch less data.”

I guess commands data are sorted in the end. But isn’t the swapping process leading to cache misses after all?

Sincerely

Reply ↓
- Stefan Reinalter on July 15, 2015 at 11:48 am said:
  
  You’re right, the posted Submit function assumes that the keys and data are already sorted, to keep things simpler.
  What I meant with my comment regarding sorting was the following:
  
  If you use something like std::sort or similar, you need to provide both keys and the associated data. This means that every time two items in the array need to swap places, both the key and the data will be swapped.
  On the other hand, if you use a radix sort (which I would recommend), no swapping of the original data takes place, but you get an array with indices of the sorted key/data pair.
  
  So in the first case, you end up touching all keys and the associated data during the sort process, and then linearly walk through the memory again when submitting.
  In the second case, you only touch the key data during sorting, and then walk the array of indices linearly, doing an indirect lookup to the keys and data. Additionally, radix-sorting 16-bit, 32-bit or 64-bit data is vastly faster than e.g. quicksort or std::sort.
  
  In some of my experiments, using a radix sort and separate key/value streams was roughly 10x times faster than std::sort, but your mileage may vary of course.
  
  Reply ↓
Julien K on September 10, 2015 at 9:03 pm said:

Hi,
I am about to implement a “stateless rendering api” myself.
However, I am not sure to what level I should abstract the renderer.

For instance, should the renderQueue also be responsible for uploading data such as textures?
And how do you handle position data (known as worldMatrix) ?

Reply ↓
- Stefan Reinalter on September 10, 2015 at 10:13 pm said:
  
  The render queue must also have commands for uploading/changing texture data, yes. Same for transformations – you will need commands for setting world, view and projection transformations.
  
  How that is handled internally (size & frequency of constant buffers) is a different story.
  
  Reply ↓
Ivan on October 26, 2015 at 9:40 am said:

Hi,

What is VertexLayoutHandle in your implementation? As I understand it contains vertices attributes such as their size, type (float, int), semantics (Position, Color, etc.). From previous posts – data is stored in some manager that return only handle. Suppose, you have 4096 index/vertex buffers for static meshes; so there must be 4096 instances for VertexLayouts?

Another question is how you manage another parameters like color, uv-coordinates? They are in static buffers too or there is another allocation strategy?

Thanks for your articles – it`s a great food for brain!

Reply ↓
- Stefan Reinalter on October 27, 2015 at 11:16 am said:
  
  What is VertexLayoutHandle in your implementation?
  
  It is a handle to a vertex layout, which contains information about the structure of the vertex data being bound. Vertex layouts can be shared between meshes, and you don’t really need that many of them. Certainly not 4096, that would be way too much.
  
  Another question is how you manage another parameters like color, uv-coordinates? They are in static buffers too or there is another allocation strategy?
  
  They belong to a model/mesh, and are therefore also stored in the static buffers holding the mesh data.
  
  Reply ↓
Rajiv on January 26, 2018 at 10:42 am said:

Why are you manually creating a command packet structure in memory instead of using a struct ? Is it because of the padding added by the compiler and being able to access actual Cmd through known offset, if yes then why not use struct and offsetof(Cmd) instead.

Reply ↓
- Stefan Reinalter on January 26, 2018 at 12:07 pm said:
  
  Because the auxiliary memory needed by the command is optional, but I want to store the data itself right next to the command packet, i.e. contiguous.
  C99’s Flexible Array Members would help, but they’re not supported by all C++ compilers.
  
  Reply ↓
  - Rajiv on March 9, 2018 at 5:35 am said:
    
    Hi Stefan
    I have one more question.
    
    How do you handle updating uniform variables (GLSL), do you create commands like UpdateUniform1F and push them into bucket chained together with Draw or DrawIndexed command or do you create special commands for rendering each effect or object like ScreenFadeOut and CharacterCloth?
    
    What is the better approach?
  - Stefan Reinalter on March 16, 2018 at 2:58 pm said:
    
    None of the two is better per se I think.
    
    The more general approach would be the first you described. However, if you find yourself in a situation where this leads to a lot of UpdateUniforms and draw commands being generated, it might be better to build specialized commands like the ones you mentioned.
- Rajiv on March 16, 2018 at 3:39 pm said:
  
  Thanks 🙂
  
  Reply ↓
jacksparow on October 18, 2018 at 9:48 am said:

In normal way.Before we use draw(…) function,we need to use the “IASetVertexBuffers”to bind the vertex buffer.Where are the right place to do this?Thanks!

Reply ↓
Betting Raja on December 20, 2018 at 7:28 am said:

If you have two arrays for key and data, how do you make sure the data array order matches the key array order after sorting?
Before sorting:
Keys[] = { 4, 5, 1, 2 }
Data[] = { Gold, Silver, Copper, Glass }

After sorting:
Keys[] = { 1, 2, 4, 5 }
Data[] = { Gold, Silver, Copper, Glass }

How do we make sure Data[] is also sorted?

Reply ↓
- Stefan Reinalter on December 21, 2018 at 12:46 pm said:
  
  You don’t :).
  Store a pointer to the data inside the key instead. Storing an index would also work and most likely be smaller than a pointer, e.g. in some scenarios a 16-bit index would probably be enough.
  
  Reply ↓
Shout on February 1, 2020 at 7:22 pm said:

How GenerateKey() is implemented?

Reply ↓
- Stefan Reinalter on March 19, 2020 at 2:59 pm said:
  
  There is no one-fits-all implementation.
  Basically, you shift and OR together the important bits of your commands. Depending on the sort criteria, the bits will have to end up either closer to the LSB or MSB, and where you place those bits is up for you to decide.
  
  See this post for an example image: http://realtimecollisiondetection.net/blog/?p=86
  
  Reply ↓