Continuing where we left off last time, today I want to present a few ideas about how to design the API that enables us to do stateless rendering.
Before we talk about how to design a stateless rendering API, let us quickly take a look at how a stateful rendering API usually performs its task.
Conventional, stateful rendering
This is the type of rendering everybody knows: you set a few states here and there, submit a draw call, set some states, submit a draw call, and so on.
Usually, this looks something like the following:
// 1) render first object backend::SetCullState(CULLSTATE_BACK); backend::SetVertexBuffer(vb); backend::SetIndexBuffer(ib); backend::BindTexture(0u, diffuse); backend::DrawIndexed(triCount*3, 0u, 0u); // 2) render second object backend::SetCullState(CULLSTATE_FRONT); backend::BindTexture(0u, otherDiffuse); backend::SetAlphaBlendState(ONE_ONE); backend::DrawIndexed(triCount*3, 0u, 0u);
The problem with that abstraction is that whatever state was set when rendering the first object also affects the rendering of the second object, which affects the rendering of the third object, and so on. State once set in the pipeline leaks into subsequent draw calls, and in case it isn’t obvious, there are actually two problems (not just one!) with the stateful abstraction above:
- First problem: When rendering e.g. a third object, it will be rendered with a reversed culling state in case we forget to set it back to CULLSTATE_BACK. Same with alpha blending. This is the smaller of the two problems.
- Second problem: Whenever you have to change a state of some draw call, all following draw calls not touching the same state will now be broken. This is much worse than the first problem, because you either have to change all the draw calls you actually didn’t want to touch, or always set back all touched upon states to their default values after a draw call has been submitted. This is both error-prone and tedious.
And we haven’t even started talking about multi-threaded rendering yet.
To elaborate a bit on the second point, imagine what would happen if we changed the code above to the following:
// 1) render first object backend::SetCullState(CULLSTATE_BACK); backend::SetVertexBuffer(vb); backend::SetIndexBuffer(ib); backend::SetRasterizerState(NO_DEPTH_WRITE); // <=== backend::BindTexture(0u, diffuse); backend::DrawIndexed(triCount*3, 0u, 0u); // 2) render second object backend::SetCullState(CULLSTATE_FRONT); backend::BindTexture(0u, otherDiffuse); backend::SetAlphaBlendState(ONE_ONE); backend::DrawIndexed(triCount*3, 0u, 0u);
By introducing a new command SetRasterizerState that changes the state of the pipeline, all draw calls following the first one are also affected by our change, because the other draw calls never touch that state. We either have to set it explicitly in the second draw call, or reset it after submitting the first DrawIndexed. It’s much worse when you want to move certain render operations from here to there, put them in a function, etc. because you always have to be aware of the “surrounding state”. Like I said, error-prone and tedious.
Introducing a stateless API
Armed with the knowledge of what’s clearly wrong with the stateful approach above, let us try to come up with better solutions. One possible solution would be to start from a clean default state each frame, and reset all the states back to their default whenever we submit a draw call. If the user were to do that himself, this could look like the following:
// at the beginning of a frame, all states are set to their default value // 1) render first object backend::SetCullState(CULLSTATE_BACK); backend::SetVertexBuffer(vb); backend::SetIndexBuffer(ib); backend::BindTexture(0u, diffuse); backend::DrawIndexed(triCount*3, 0u, 0u); backend::ResetDefault(); // <=== // 2) render second object backend::SetCullState(CULLSTATE_FRONT); backend::BindTexture(0u, otherDiffuse); backend::SetAlphaBlendState(ONE_ONE); backend::DrawIndexed(triCount*3, 0u, 0u); backend::ResetDefault(); // <===
Of course, we could also put that functionality into our API, and let it take care of that.
For now, let us assume that we have one big render queue which is used for queueing up all draw calls during a frame, which then get sorted and dispatched using the render backend at the end of a frame. Then we could do the following:
// 1) render first object renderQueue::SetCullState(CULLSTATE_BACK); renderQueue::SetVertexBuffer(vb); renderQueue::SetIndexBuffer(ib); renderQueue::BindTexture(0u, diffuse); renderQueue::SubmitIndexed(triCount*3, 0u, 0u); // 2) render second object renderQueue::SetCullState(CULLSTATE_FRONT); renderQueue::BindTexture(0u, otherDiffuse); renderQueue::SetAlphaBlendState(ONE_ONE); renderQueue::SubmitIndexed(triCount*3, 0u, 0u); // at the end of a frame: renderQueue::Sort(); renderQueue::Flush();
Basically, all our renderQueue implementation has to do is the following:
- Keep track of the currently set vertex buffer, index buffer, cull state, alpha state, texture samplers, etc. Whenever someone calls renderQueue::Set*State(), simply change the corresponding member to the new state.
- For each Submit*() call, insert a new draw call into the queue. Our queue in this case would be raw memory, and we would simply store the type of the operation (an indexed draw call), the key (used for sorting), and all data that goes along with the draw call (in our case all the current states). After that, we reset all our internal state members to their default value.
- Upon a call to Sort(), we simply sort all the keys using e.g. a radix sort.
- Upon a call to Flush(), we walk the sorted array of operations, fetch the type, fetch the data, and call the respective render backend functions. It’s very similar to implementing a simple virtual machine.
Of course, there are many implementation details we haven’t talked about yet, but that’s basically the gist of it. However, there is one thing I really don’t like about that approach, as soon as multi-threaded rendering enters the picture.
With multi-threaded rendering, we want to be able to call any renderQueue function from any thread, which means that even though the C++ code looks like sequential code, calls to various renderQueue::Set*() functions are made from different threads, and are therefore interleaved. We can no longer use simple members in our renderQueue implementation to keep track of the current state, and not even wrapping each function with a mutex (or similar) would work, because we would need to wrap all operations that belong to a single draw call at once. This has way too much overhead, don’t even think about doing it this way.
There is of course a simpler, and faster solution to that: thread-local storage. Instead of keeping track of the currently set state using simple members in the renderQueue, each thread keeps track of its state using e.g. a thread-local struct which holds all the states.
However, I’m still not satisfied with such an approach, because it means that every renderQueue function call now has to access some thread-local variable, which adds overhead compared to just accessing memory. Therefore, I am also considering the following alternatives.
The first one boils down to creating structs holding the state for a draw call on the stack, and then copying all of it into the queue upon submitting a draw call, something like the following:
IndexedDrawCall dc; dc.SetVertexBuffer(mesh->vertexBuffer); dc.SetIndexBuffer(mesh->indexBuffer); dc.SetCullState(CULLSTATE_BACK); renderQueue::Submit(dc);
Firstly, this pretty much gets rid of all the multi-threading problems we have seen in the approach above. If we want to use one global queue, all we have to do is copy the data given in a call to renderQueue::Submit() (along with the key for sorting). For that, we can simply use a linear allocator that does nothing more than increment a pointer for each allocation. By using atomic operations, we can trivially make the allocation both thread-safe, and fast. If we don’t want to use atomic operations, we can use a thread-local queue per thread instead.
Secondly, this would allow us to cache certain draw calls. For certain static parts of the world, we could build the draw call once, store it somewhere, and submit it into the renderQueue without any additional work.
Thirdly, each draw call like IndexedDrawCall, InstancedDrawCall, ComputeDrawCall, etc. could make sure to only store the data it needs, which could cut down on the amount of memory required to store the individual draw calls.
There are two things which I don’t like with this approach, though:
- Each instance of a draw call struct is stateful again, which means that the user could create a draw call on the stack, submit it once, change its state, and submit it again. Of course that’s up to the user and not recommended, but in that regard we are back to square one, so to say.
- We are accessing memory much more often than we need to, because we first change the state of the struct on the stack, and then copy all its data to some other place in memory depending on where renderQueue::Submit() copies the data to.
Which brings me to my last and currently preferred alternative:
Instead of creating draw call structs on the stack, you have to ask the renderQueue to hand one to you:
IndexedDrawCall* dc = renderQueue::CreateIndexedDrawCall(); dc->SetVertexBuffer(mesh->vertexBuffer); dc->SetIndexBuffer(mesh->indexBuffer); dc->SetCullState(CULLSTATE_BACK); renderQueue::Submit(dc);
It doesn’t look like much of a difference, but there are a few things we can do here:
- When creating a new draw call (e.g. using CreateIndexedDrawCall()), we again have the option of using a global queue and atomic operations for allocating memory, or use thread-local queues. I would prefer the latter (more on that in the next post), but the point is that “creating” such a draw call essentially just increments a pointer internally, handing the user the final destination of all the draw call’s data. This means that we no longer manipulate a struct on the stack and copy it afterwards, but directly write into memory. A call to Submit() then only has to store the key, and a pointer to where the data is stored.
- Because we are in control of how draw calls are created, we can easily make sure that the user cannot submit a draw call twice. We could do that by e.g. checking the pointer given as an argument to renderQueue::Submit(): if its address is less or equal to that of the last submitted draw call, the user tried to submit the same draw call twice – which is invalid, because that implies stateful usage of a draw call struct.
As can be seen, there are a few alternatives to how we can implement a stateless API. I think it is important to keep in mind things like multi-threaded rendering and how memory allocations for draw call data is handled when designing such an API.
Note that I only briefly touched the subject of multi-threaded rendering. There are many more things to consider like false sharing, how allocations are made, and when and how data is written to memory. I think about those things when designing such an API, but didn’t have the time (yet) to write down all my thoughts and ideas – the post is already quite long as it is.
Further note that we also haven’t talked about how to generate keys for sorting the data yet, and how we try to “group” draw calls by individual layers, introduced in the first post. This will be the topic of the next post!
I have not implemented any of the above yet, so please take this with a grain of salt. It is surely not a final design, because these things usually take a few iterations until you come up with something that you are truly satisfied with.
Let me know about any oversighs or faults I made, and feel free to discuss other, better alternatives I might have missed in the comments!