Continuing where we left off last time, today I want to present a few ideas about how to design the API that enables us to do stateless rendering.

Before we talk about how to design a stateless rendering API, let us quickly take a look at how a stateful rendering API usually performs its task.

Conventional, stateful rendering

This is the type of rendering everybody knows: you set a few states here and there, submit a draw call, set some states, submit a draw call, and so on.

Usually, this looks something like the following:

// 1) render first object
backend::SetCullState(CULLSTATE_BACK);
backend::SetVertexBuffer(vb);
backend::SetIndexBuffer(ib);
backend::BindTexture(0u, diffuse);
backend::DrawIndexed(triCount*3, 0u, 0u);

// 2) render second object
backend::SetCullState(CULLSTATE_FRONT);
backend::BindTexture(0u, otherDiffuse);
backend::SetAlphaBlendState(ONE_ONE);
backend::DrawIndexed(triCount*3, 0u, 0u);

The problem with that abstraction is that whatever state was set when rendering the first object also affects the rendering of the second object, which affects the rendering of the third object, and so on. State once set in the pipeline leaks into subsequent draw calls, and in case it isn’t obvious, there are actually two problems (not just one!) with the stateful abstraction above:

First problem: When rendering e.g. a third object, it will be rendered with a reversed culling state in case we forget to set it back to CULLSTATE_BACK. Same with alpha blending. This is the smaller of the two problems.
Second problem: Whenever you have to change a state of some draw call, all following draw calls not touching the same state will now be broken. This is much worse than the first problem, because you either have to change all the draw calls you actually didn’t want to touch, or always set back all touched upon states to their default values after a draw call has been submitted. This is both error-prone and tedious.

And we haven’t even started talking about multi-threaded rendering yet.

To elaborate a bit on the second point, imagine what would happen if we changed the code above to the following:

// 1) render first object
backend::SetCullState(CULLSTATE_BACK);
backend::SetVertexBuffer(vb);
backend::SetIndexBuffer(ib);
backend::SetRasterizerState(NO_DEPTH_WRITE); // <===
backend::BindTexture(0u, diffuse);
backend::DrawIndexed(triCount*3, 0u, 0u);

// 2) render second object
backend::SetCullState(CULLSTATE_FRONT);
backend::BindTexture(0u, otherDiffuse);
backend::SetAlphaBlendState(ONE_ONE);
backend::DrawIndexed(triCount*3, 0u, 0u);

By introducing a new command SetRasterizerState that changes the state of the pipeline, all draw calls following the first one are also affected by our change, because the other draw calls never touch that state. We either have to set it explicitly in the second draw call, or reset it after submitting the first DrawIndexed. It’s much worse when you want to move certain render operations from here to there, put them in a function, etc. because you always have to be aware of the “surrounding state”. Like I said, error-prone and tedious.

Introducing a stateless API

Armed with the knowledge of what’s clearly wrong with the stateful approach above, let us try to come up with better solutions. One possible solution would be to start from a clean default state each frame, and reset all the states back to their default whenever we submit a draw call. If the user were to do that himself, this could look like the following:

// at the beginning of a frame, all states are set to their default value

// 1) render first object
backend::SetCullState(CULLSTATE_BACK);
backend::SetVertexBuffer(vb);
backend::SetIndexBuffer(ib);
backend::BindTexture(0u, diffuse);
backend::DrawIndexed(triCount*3, 0u, 0u);
backend::ResetDefault(); // <===

// 2) render second object
backend::SetCullState(CULLSTATE_FRONT);
backend::BindTexture(0u, otherDiffuse);
backend::SetAlphaBlendState(ONE_ONE);
backend::DrawIndexed(triCount*3, 0u, 0u);
backend::ResetDefault(); // <===

Of course, we could also put that functionality into our API, and let it take care of that.

For now, let us assume that we have one big render queue which is used for queueing up all draw calls during a frame, which then get sorted and dispatched using the render backend at the end of a frame. Then we could do the following:

// 1) render first object
renderQueue::SetCullState(CULLSTATE_BACK);
renderQueue::SetVertexBuffer(vb);
renderQueue::SetIndexBuffer(ib);
renderQueue::BindTexture(0u, diffuse);
renderQueue::SubmitIndexed(triCount*3, 0u, 0u);

// 2) render second object
renderQueue::SetCullState(CULLSTATE_FRONT);
renderQueue::BindTexture(0u, otherDiffuse);
renderQueue::SetAlphaBlendState(ONE_ONE);
renderQueue::SubmitIndexed(triCount*3, 0u, 0u);

// at the end of a frame:
renderQueue::Sort();
renderQueue::Flush();

Basically, all our renderQueue implementation has to do is the following:

Keep track of the currently set vertex buffer, index buffer, cull state, alpha state, texture samplers, etc. Whenever someone calls renderQueue::Set*State(), simply change the corresponding member to the new state.
For each Submit*() call, insert a new draw call into the queue. Our queue in this case would be raw memory, and we would simply store the type of the operation (an indexed draw call), the key (used for sorting), and all data that goes along with the draw call (in our case all the current states). After that, we reset all our internal state members to their default value.
Upon a call to Sort(), we simply sort all the keys using e.g. a radix sort.
Upon a call to Flush(), we walk the sorted array of operations, fetch the type, fetch the data, and call the respective render backend functions. It’s very similar to implementing a simple virtual machine.

Of course, there are many implementation details we haven’t talked about yet, but that’s basically the gist of it. However, there is one thing I really don’t like about that approach, as soon as multi-threaded rendering enters the picture.

With multi-threaded rendering, we want to be able to call any renderQueue function from any thread, which means that even though the C++ code looks like sequential code, calls to various renderQueue::Set*() functions are made from different threads, and are therefore interleaved. We can no longer use simple members in our renderQueue implementation to keep track of the current state, and not even wrapping each function with a mutex (or similar) would work, because we would need to wrap all operations that belong to a single draw call at once. This has way too much overhead, don’t even think about doing it this way.

There is of course a simpler, and faster solution to that: thread-local storage. Instead of keeping track of the currently set state using simple members in the renderQueue, each thread keeps track of its state using e.g. a thread-local struct which holds all the states.

However, I’m still not satisfied with such an approach, because it means that every renderQueue function call now has to access some thread-local variable, which adds overhead compared to just accessing memory. Therefore, I am also considering the following alternatives.

Alternative 1

The first one boils down to creating structs holding the state for a draw call on the stack, and then copying all of it into the queue upon submitting a draw call, something like the following:

 IndexedDrawCall dc;
 dc.SetVertexBuffer(mesh->vertexBuffer);
 dc.SetIndexBuffer(mesh->indexBuffer);
 dc.SetCullState(CULLSTATE_BACK);
 renderQueue::Submit(dc);

Firstly, this pretty much gets rid of all the multi-threading problems we have seen in the approach above. If we want to use one global queue, all we have to do is copy the data given in a call to renderQueue::Submit() (along with the key for sorting). For that, we can simply use a linear allocator that does nothing more than increment a pointer for each allocation. By using atomic operations, we can trivially make the allocation both thread-safe, and fast. If we don’t want to use atomic operations, we can use a thread-local queue per thread instead.

Secondly, this would allow us to cache certain draw calls. For certain static parts of the world, we could build the draw call once, store it somewhere, and submit it into the renderQueue without any additional work.

Thirdly, each draw call like IndexedDrawCall, InstancedDrawCall, ComputeDrawCall, etc. could make sure to only store the data it needs, which could cut down on the amount of memory required to store the individual draw calls.

There are two things which I don’t like with this approach, though:

Each instance of a draw call struct is stateful again, which means that the user could create a draw call on the stack, submit it once, change its state, and submit it again. Of course that’s up to the user and not recommended, but in that regard we are back to square one, so to say.
We are accessing memory much more often than we need to, because we first change the state of the struct on the stack, and then copy all its data to some other place in memory depending on where renderQueue::Submit() copies the data to.

Which brings me to my last and currently preferred alternative:

Alternative 2

Instead of creating draw call structs on the stack, you have to ask the renderQueue to hand one to you:

 IndexedDrawCall* dc = renderQueue::CreateIndexedDrawCall();
 dc->SetVertexBuffer(mesh->vertexBuffer);
 dc->SetIndexBuffer(mesh->indexBuffer);
 dc->SetCullState(CULLSTATE_BACK);
 renderQueue::Submit(dc);

It doesn’t look like much of a difference, but there are a few things we can do here:

When creating a new draw call (e.g. using CreateIndexedDrawCall()), we again have the option of using a global queue and atomic operations for allocating memory, or use thread-local queues. I would prefer the latter (more on that in the next post), but the point is that “creating” such a draw call essentially just increments a pointer internally, handing the user the final destination of all the draw call’s data. This means that we no longer manipulate a struct on the stack and copy it afterwards, but directly write into memory. A call to Submit() then only has to store the key, and a pointer to where the data is stored.
Because we are in control of how draw calls are created, we can easily make sure that the user cannot submit a draw call twice. We could do that by e.g. checking the pointer given as an argument to renderQueue::Submit(): if its address is less or equal to that of the last submitted draw call, the user tried to submit the same draw call twice – which is invalid, because that implies stateful usage of a draw call struct.

Conclusion

As can be seen, there are a few alternatives to how we can implement a stateless API. I think it is important to keep in mind things like multi-threaded rendering and how memory allocations for draw call data is handled when designing such an API.

Note that I only briefly touched the subject of multi-threaded rendering. There are many more things to consider like false sharing, how allocations are made, and when and how data is written to memory. I think about those things when designing such an API, but didn’t have the time (yet) to write down all my thoughts and ideas – the post is already quite long as it is.

Further note that we also haven’t talked about how to generate keys for sorting the data yet, and how we try to “group” draw calls by individual layers, introduced in the first post. This will be the topic of the next post!

Disclaimer

I have not implemented any of the above yet, so please take this with a grain of salt. It is surely not a final design, because these things usually take a few iterations until you come up with something that you are truly satisfied with.

Let me know about any oversighs or faults I made, and feel free to discuss other, better alternatives I might have missed in the comments!

13 thoughts on “Stateless, layered, multi-threaded rendering – Part 2: Stateless API Design”

Andrei Radu on November 14, 2014 at 8:03 am said:

Another advantage of your last approach is that it maps easily to something like Metal, with a render queue being a thin wrapper over a RenderCommandEncoder(or a ParallelRenderComandEncoder). You would have to do the sorting somewhere else(or just don’t sort)

Reply ↓
- Stefan Reinalter on November 14, 2014 at 10:27 am said:
  
  Hi Andrei,
  
  Thanks for the info. I haven’t done any work with Metal yet, so it’s good to know!
  
  Reply ↓
knarkowicz on November 15, 2014 at 7:17 pm said:

Hi,

We are using similar design for more than 6 years. It’s very simple and works great. For us it started as a method for sorting draw calls, so we have only struct type, but now we also use it for MT rendering. The only downside is that it gets complicated when you try to optimize by caching static draw calls.

Reply ↓
rotoglup on November 16, 2014 at 4:16 pm said:

Hi,

Your ‘alternative 2’ does not seem to prevent the problem you quote for ‘alternative 1’: drawcall instances can still be ‘stateful’ – as nothing prevents the user to keep a pointer an modify it’s content after the drawcalls have been sorted, say. Am I missing something ?

Have you considered having immutable state objects, that could be referenced by each drawcall ?

Reply ↓
- Stefan Reinalter on November 16, 2014 at 6:49 pm said:
  
  Hi,
  
  Yes, you’re right, but I find it easier to prevent the user from doing that with ‘alternative 2’. There are a couple of options for that, which would effectively turn them into immutable objects – once their state has been set, nobody can change it anymore.
  
  I was considering ‘real’ immutable objects mostly for render states, because they are already treated like that in the engine. I would like to go one step further though and do something similar to D3D12 pipeline state objects, but more on that in the next blog post perhaps.
  
  Reply ↓
sturm on November 17, 2014 at 4:36 pm said:

How do you stay “stateless” with shader uniforms? A camera’s view and projection matrix could be set once in a constant buffer and used by multiple draw calls in your render queue.

Reply ↓
- Stefan Reinalter on November 17, 2014 at 9:05 pm said:
  
  General shader uniforms are copied into the buffer associated with a draw call, and only set to a constant buffer once that draw call is submitted to the rendering API. Things like a camera’s view and projection matrix are associated with a layer, as are color and depth render targets. The user would setup several layers, and specify the camera to use, the render targets to draw to, etc. for each of these layers. When submitting a draw call, one of these layers needs to be explicitly referenced. Draw calls can then be sorted by layers, and a layer’s data is referenced whenever a draw call in that layer is submitted to the rendering API.
  
  Reply ↓
rasmusrn on November 27, 2014 at 4:54 pm said:

Hi Stefan, thanks for a great blog post. Looking forward to part 3.

How do you store and/or reference data that is not relevant across all draw call? For example, drawing a bone mesh would require pose data whereas drawing a static mesh would not.

Do you somehow write this data into the renderQueue, if so how? Or do you store some type/id combination for each draw call, so you can identify the object (and hereby locate any related custom data such as pose data) after the sort? I hope my question makes sense.

Reply ↓
- Stefan Reinalter on December 1, 2014 at 2:31 pm said:
  
  Data that belongs to a draw call is stored with that draw call.
  In the case of a skinned bone mesh, the draw call needs the data in a constant buffer on the GPU, so all that data is copied when the draw call is issued to the queue, so it can be retrieved later when dispatching the draw call using the API in the render backend. I briefly talk about this in the post.
  
  For sorting the draw calls, each draw call is associated with a key. For each draw call, you would store its key, and the offset to the data in memory (that’s how it is described in the BitSquid presentation I linked). This means that each draw call has a different amount of data that needs to be stored alongside. That is also the reason why you only exchange keys and their offset into the data stream when sorting the data, and do not exchange the data itself upon each sort operation. That would be horribly inefficient.
  
  I will write about this in more detail in one of the next posts.
  
  Reply ↓
Pingback: Stateless, layered, multi-threaded rendering – Part 3: API Design Details | Molecular Musings
Pingback: Stateless, layered, multi-threaded rendering – Part 4: Memory Management & Synchronization | Molecular Musings
Jake Megaffin on April 18, 2015 at 8:28 pm said:

You could also leverage the compiler to prevent repeated submissions of the same draw call using an std::unique_ptr. CreateIndexedDrawCall() would return a unique_ptr which would have to be moved into Submit(), preventing modification and resubmission.

Reply ↓
Matt Davies on April 20, 2017 at 5:01 pm said:

I used a slightly different approach by using declarative approaches. I have a structure that represents the entire state that the API can have (in my case OpenGL). I then have a SetState function, which works out the differences and makes the minimum calls required to make the OpenGL state equal to myt State structure.

I had a function to set my State structure to predefined defaults and a function to initialise the State structure to the current state of OpenGL. I then alter the structure to my needs and call SetState. I never had state bugs again.

Reply ↓