Stateless, layered, multi-threaded rendering – Part 1

In this post, I would like to describe what features and performance characteristics I want from a modern rendering system: it should support stateless rendering, rendering in different layers/buckets, and rendering that can run in parallel on as many cores as are available.

I have been pondering about how to implement such a rendering system efficiently lately, and wanted to document/share my ideas and findings so far, before I go and implement the whole thing.

Rendering backend

What do I mean by a rendering backend? In my opinion, the rendering backend should be responsible for only one thing: submitting draw calls using a graphics API such as D3D or OGL. It is the responsibility of higher-level systems to make sure that only the minimum amount of draw calls are made, and that draw calls and state changes are ordered and optimized.

Stateless rendering

All graphics APIs we usually deal with are stateful things. This means that whenever you change any state in the API for subsequent draw calls, this state change also affects draw calls submitted at a later point in time. As an example, if you change the culling state from backface to frontface culling for some object, you need to either reset the state after the object finished rendering, or set a default state for all other objects, otherwise some objects end up being rendered with the wrong culling state.

Exposing such a stateful API to the user is error-prone, and a bad abstraction. Ideally, submitting a draw call with whatever state we want should not affect any of the other draw calls. This would allow us to treat each individual draw call like a single “thing” that carries all state it needs with it, not leaking any of its state into other draw calls.

This would also enable us to easily change the order of draw calls (as long as the rendering result stays the same), allowing us to get rid of redundant state changes, and sorting draw calls by a certain key (front-to-back,, back-to-front, or some other criteria).

The presentation about Firaxis’ LORE system used in Civilization V goes a bit more into detail.

Layered rendering

Also known as bucketized rendering, the idea here is to assign a key to a draw call which is then used for sorting. Typically, a key is just a single 32-bit or 64-bit integer, nothing more. Usually, a key encodes certain data like distance, material, shader, etc. of a draw call in individual bits. Depending on where those bits are stored in the integer, you can apply different sorting criteria for the same array of draw calls, as long as you know how the keys were built.

This is a very efficient and straightforward approach, because you can use e.g. a simple radix sort on the integers, and don’t have to worry about how to sort the data (is it sorted by distance? by texture? by material?). The sorting criteria is basically encoded in the bits of the integer – if you want to sort by material rather than by distance, just put the respective bits in a different place.

Christer Ericson’s blog explains the concept well, if you are not familiar with it.

There is one thing I would most likely change compared to Christer’s approach, though. I would argue that the renderer itself (e.g. a Deferred Renderer, a Clustered Renderer, a Forward+ Renderer) knows how to render the entities, and hence also knows how many and which layers it needs.

For example, a simple Deferred Renderer will first render objects to the G-Buffer, then render the decals, then render a shadow map for each shadow casting light source, apply the lighting of all light sources one by one, and finally render transparent objects using a forward pass, followed by HUD elements and similar things. Of course there are dozens of different implementations possible, but you get the idea.

My point is that I would not try to cram every draw call of every layer into a key of the same size, storing them all in one big stream (or per-thread local streams), but rather use differently sized keys for different layers. E.g. when rendering a shadow map, objects should be sorted front-to-back, and there is no need to sort by material or shader. Hence, a single 16-bit integer is probably enough for a rough front-to-back sort based on the distance to the camera. Therefore, I would put those 16-bit keys into a different “bucket” than draw calls that belong to other layers.

I might have 16-bit keys for shadow map buckets, 32-bit keys for transparent object buckets, and 64-bit keys for general G-Buffer buckets. The idea here is that smaller data can be sorted faster, and individual buckets can be sorted in parallel on different threads.

Multi-threaded rendering

Using such a layered/bucketized system, one of the big benefits we get is of course the ability to use all available threads for rendering. The common approach is to queue all draw calls into layers/buckets for a whole frame, sort them by key, and then submit them to the graphics API using the aforementioned rendering backend. API calls are only done from the main thread, but queueing the individual draw calls can easily be done in parallel.

If the engine follows a data-oriented approach, and/or uses a task scheduler, each core can work on N given entities at a time, possibly splitting its work into several tasks which are handed to the scheduler. Instead of storing all draw calls and their keys in one single stream of data, I would probably use an approach similar to the one used by the Bitsquid engine: storing draw call data in per-thread local streams, which are then sorted and merged before being submitted on the main thread.

General thoughts

Last but not least, just a thought about rendering in general: rendering of entities should happen by pushing draw calls into individual buckets, not by pulling state from each and every entity for each bucket.

More specifically, instead of doing this:

for each (bucket b)
  for each (entity e)
    submit_draw_call_to_bucket(b, get_draw_call(e, b));

You should rather to this:

for each (entity e)
  submit_draw_call_to_bucket(shadow_map, e);
  submit_draw_call_to_bucket(g_buffer, e);

This might seem trivial to some of you, but I thought it was worth pointing out. After all, there are still many people that render a std::vector of GameObjects by iterating through the vector, calling a virtual Render() method for each object. Nothing wrong per se with that, but you won’t be doing a game with hundreds of hundreds of entities with such an approach.

Being able to push draw calls into different buckets ensures that we only touch each entity once per frame instead of several times, which can make a difference in performance if you have lots of entities.

That’s it for today! I’m hoping that there will be more things to share and maybe even some code to look at by Thursday next week!

22 thoughts on “Stateless, layered, multi-threaded rendering – Part 1

  1. Hi Stefan,

    There are multiple sources online about stateless rendering as you explain it (there are some links in your very post) and they always talk about the draw call submission part.

    I’m curious about the creation/destruction/update part of the problem, though.

    Usually, in a frame, the engine gathers the user inputs and performs the modifications to game/entities state at the beginning. Then, this changes get propagated to the rendering subsystem to be submitted to the GPU via the methods you talk about in the post. The rendering part can be dispatched to a rendering thread, that owns the API device and processes all the command queues while the main/worker threads start working on the next frame.

    While the rendering is taking place, the worker threads might make updates to graphics entities for the next frame, e.g. update the skinning matrices in the constant buffer of an animated character.

    How is this scheduled? Is it submitted to a pending work queue so all the GPU interactions are done once the render thread has finished rendering the previous frame? I’m assuming that they need to be handled differently from the draw calls.

    The same happens with creation and destruction of resources. While the rendering system is rendering a loading screen, it may need to create buffers, shaders, GPU states for the new level, destroy old objects from the old level and map buffers to be written to with the new loaded data.

    How is that combined with the render thread that is waiting for draw call queues to be rendered? Are the resource management requests fulfilled before a frame rendered is dispatched?

    I’m sorry if there are some dumb questions. I’ve never seen an implementation like this in place and have a lot of doubts when doing my own.

    Thank you for your insight on the problem! I’m looking forwart to the next post!

    • Hey Marc,

      Don’t worry, those are not dumb questions!
      To make things a little bit easier to think through, forget about the rendering thread for a second, and concentrate on the following per-frame loop:

      • The CPU kicks of the command buffer/queue for the GPU to work on, so the GPU now renders frame N.
      • While the GPU is busy rendering frame N, the CPU updates the world state/simulation for frame N+1.

      CPU and GPU are happily churning through their tasks in parallel, but as you noticed, this could create problems when the CPU tries to alter a resource which is used by the GPU.

      When using an approach as described in the post, there are basically 3 different methods that deal with the problem of updating resources while they might still be in use by the rendering system. Those are the following:

      1. Double-buffer all dynamic resources. This is easy to implement, and really simple to understand. E.g. if you need to update a vertex buffer dynamically, you create two buffers internally, and swap between them each frame. Whenever the GPU is busy rendering from buffer A, the CPU will update buffer B, and vice versa. This approach doesn’t need any fences or other kinds of synchronization, just more memory.
      2. Copy all the CPU data for the respective draw call, and update the GPU resource before rendering. What I mean by this is that instead of updating a GPU resource directly using e.g. Map/Unmap when creating the command queue, you would memcpy() the data to a memory block on the heap, and use that for updating the GPU resource when that command is encountered in the queue. Usually, one uses a per-frame, temporary linear allocator for this kind of data. This essentially defers the resource update to the point where the GPU queue is emptied and submitted to the API. Of course this needs extra memory and more copy operations, but the nice thing is that you can just hand of your dynamic data at one point and forget about it, without having to synchronize.
      3. The third option is a variant of the second option, and tries to get rid of the extra memory and copy operations needed. All it does is store a pointer to the dynamic data instead of copying it somewhere, which means that you need to make sure that the pointer still holds the correct data whenever the GPU queue is submitted to the API. This is more error-prone and shifts the burden to the programmer using it, but can save memory and performance for certain kinds of operations.

      In the end, you probably want to support all three options in your engine, because what option you need really depends on how a resource is being used. Last but not least, there are of course variations of the above, like using a shared ring-buffer for resources that are updated with approx. the same frequency. Check out Transient Buffers.

      • Hi Stefan,

        Thank you for your detailed reply!

        For updating GPU data, it basically boils down to either copy the data to a double/triple/ring GPU buffer permanently mapped or find a way to defer the data uploading until last frame is rendered, so access to GPU resources is allowed again.

        Having this timed structure makes a lot of sense! I see it clearer, now!

        Thanks for the link to the Transient Buffers slides! Really interesting!

  2. Pingback: More AI work | Spellcaster Studios

  3. Hi,
    really high-quality content and interesting blog.

    For now I have a prototyped renderer which iterates over scene/renderables and just render them using brute-force (no sorting and culling :/) – I really hate it so I tried to implement a render queue as you described but I went into some problems – so dumb questions follows:
    Let’s suppose that we are using multiple render queues: opaque, decals, transparent, particles, shadowmap, lights. The problem (at least for me) starts to be visible when I want to have a data-driven renderer, how many queues to create? They are fixed (hard-coded in the scene traversal routine) or also data-driven? How to connect render queues with different “phases” like shadow mapping, particles and deferred rendering? Maybe I should stay with the “classy” render queue layout (opaque, transparent)…

    I know that its a shame to ask so fundamental questions, but I just try to understand the concept behind by experimenting with it.

      • Thank you for your reply!

        I actualy saw those presentation, read their blog, and that’s how my renderer looks like, but it’s unclear to me – probably I should try to ask Niklas – but I don’t know how to connect scene=>render queues=>layers/resource generators. If I have a data-driven renderer (hence data-driven render queues, right?) the scene does not know into which layer put stuff in… For example I filled the RQ like you said:
        submit_draw_call_to_bucket(shadow_map, e);
        submit_draw_call_to_bucket(g_buffer, e);

        How does a data-driver renderer know which RQ should be used for given pass/phase/layer/resource generator. Are render queues hardcoded? Do you know any good resource where I can learn about them?


        PS: For now I think that I should create hashmap with “named” renderqueues and reference them from render_config (but I don’t think this is good approach, or data-oriented, since scene will not know how to populate them). Still hunting for good approach…

      • Well, if you look at Bitsquid’s data-driven renderer, here’s what I understood:

        • Each layer specifies the render targets to use, and which kind of depth-sorting.
        • The shader system can reference layers by name.
        • The shader system points out what layer to render into.

        For the queues, I think that you pretty much have two options:

        • Give each layer its own render queue. This is what I would do.
        • Store a few bits corresponding to the layer in each key, and use one only queue. From what I’ve seen regarding the key layout used in Bitsquid, I would suspect that this is what they use.

        Furthermore, the shader system is the one that knows which shaders render into which layers. I would suspect that this is something that is either set up in the editor per shader, or per mesh/model. All you have to do is store a few bits for each model, so that you know into which queue you need to render each model.

        Hope that helps, and maybe Niklas can provide you with more details!

  4. Pingback: Stateless, layered, multi-threaded rendering – Part 3: API Design Details | Molecular Musings

  5. Something doesn’t click for me, so I ask here:

    Let’s say I have one shader which require 5 uniforms(material properites) and 10 objects with different material properties. I sort these objects and after that I’m going to draw the objects, but because I have different materials I have to change those uniforms for every object…so where is the performance in this case? Changing uniform values doesn’t affect perfromance?

    Second scenario: I have two shaders and I sort my 10 objects. So I have my objects sorted and now I’m going to draw them. Let’s say first 5 objects that require shader 1 and other 5 objects require shader 2.
    My question is how would I know that after 5 objects I have to bind other shader?
    | 00 01 02 03 04 15 16 17 18 19 | -> my sorted array. 0 is first shader 1 is second shader
    | -> change shader

    In the middle of the loop I have to bind the second shader which seems hard.

    I am super confused right now, maybe you can help me 🙂

    • Changing uniform values does affect performance, because it leads to more draw calls. Sorting won’t help with that, but instanced rendering would. With sorting you can at least detect when the same objects use different materials, and use instanced rendering in those cases. I have seen engines that do this automatically, so it’s definitely possible.

      Regarding your second question, the easiest thing to do is trying to apply the current shader in each iteration of the loop, and have a caching mechanism in the backend which will only bind a shader if it’s different from the last one. With sorted rendering, this greatly reduces the number of API calls in the backend.

  6. Hi Stefan,
    very interesting blog and full of useful ideas,
    I’m trying to develop a rendering engine using an entity component system, but while reading your articles several questions/doubts come to mind… I’m not sure if this post is the right place for my questions, if not, my apologies… here my dumb questions:

    it’s a better choice to make the systems own its components or store them elsewhere? like a component manager?

    If a system needs to match two or more components, for example the renderingSys needs the meshCmp and the transformCmp, who is in charge to match the correct transform with the correct mesh? all system have its own copy of the desired components?

    when using multi threading, two system needs to be synchronised? e.g. the first waiting for the result of the second?

    best wishes!

    • Hi Daniele,

      it’s a better choice to make the systems own its components or store them elsewhere? like a component manager?

      In my design, all the components are stored in the systems they belong to. For example, the LightingSystem stores components for point lights and directional lights in a SOA fashion. You could also have separate managers that keep track of components, which are then used by systems.
      I don’t think either one of these approaches is superior to the other, it really depends on how everything else like entities and systems are structured.

      If a system needs to match two or more components, for example the renderingSys needs the meshCmp and the transformCmp, who is in charge to match the correct transform with the correct mesh? all system have its own copy of the desired components?

      Same as above, it depends on many factors. You could have different component managers that keep track of which components are part of which entity, similar to what Niklas is doing in the Bitsquid Engine.

      when using multi threading, two system needs to be synchronised? e.g. the first waiting for the result of the second?

      Not necessarily. The best data is one that isn’t shared – and if data isn’t shared, there’s no need to synchronize. For example, one common solution to avoid the problem of shared data and synchronization is just to double-buffer the data. Granted, you can’t use that everywhere, but often it’s a simple solution.

      Furthermore, nowadays most engines split the Update & Render portion of a frame into lots of (mostly) independent tasks using something called a task scheduler. The task scheduler can automatically take care of parent/child relationships, can split large tasks into several small ones, do load-balancing, etc. The task scheduler is then mostly responsible for synchronization among tasks, and gives users easy-to-understand tools to do so.

      • Thanks Stefan for the reply, I’ll read more about the task scheduler, seems interesting ; )

      • I’ve read some article (including yours) about task scheduler, it’s a really interesting subject and I’m going to give it a try, but more questions come to mind:

        – Are the system submitting new task? e.g. every “tick” of the running loop each system submits its task?
        – Is there any task used for frame synchronisation? e.g. end-frame task with all update tasks as dependencies? Are they useful?
        – What about the rendering part? Is the rendering split in tasks and submitted to the scheduler or is done via a “serial” manner? e.g. for N mesh there will be N rendering tasks run asynchronously?
        – I imagine the back-end render (who invokes openGL/directX calls) runs without using the scheduler, e.g. the front-end use tasks to send commands to back-end, the back-end wait until all per-frame drawing calls are submitted and then render them in a “normal manner”, I’m wrong?

        Thanks again, especially for the valuable help about those topics/techniques.

      • – Are the system submitting new task? e.g. every “tick” of the running loop each system submits its task?

        Yes, each system would submit its own tasks. The update-loop could basically consist of a root task, that has several child tasks, which in turn would be the root tasks of each system. Those tasks could spawn new tasks, which spawn new tasks, and so on. The systems/tasks could evenly distribute their work across all available worker threads. This can automatically be done by using work-stealing.

        – Is there any task used for frame synchronisation? e.g. end-frame task with all update tasks as dependencies? Are they useful?

        Yes, I would make a single root task which is the per-frame synchronization point. As soon as this root task (and all its children) are finished, you can kick off the rendering in the back-end.

        – What about the rendering part? Is the rendering split in tasks and submitted to the scheduler or is done via a “serial” manner? e.g. for N mesh there will be N rendering tasks run asynchronously?

        Adding render commands to layers/buckets/queues (see my latest posts on this subject) is also done using the scheduler, because the scheduler can ensure to use all available cores as efficient as possible.

        – I imagine the back-end render (who invokes openGL/directX calls) runs without using the scheduler, e.g. the front-end use tasks to send commands to back-end, the back-end wait until all per-frame drawing calls are submitted and then render them in a “normal manner”, I’m wrong?

        Yes, exactly. The process of submitting the actual draw calls to the graphics API is done in a serial manner. Nevertheless, this can (and should) be done from a separate thread. This means that while one thread is kicking off the draw calls to the graphics API, the main thread (and all available worker threads) can start working on the next frame in parallel. Most often a semaphore is used to signal to the render thread that all draw calls are now ready to be submitted. Similarly, you also need a semaphore in the other direction which tells the main thread that it should start working on the next frame. You need this semaphore to ensure that the CPU doesn’t get too far ahead of the GPU.

        See Jeff Preshing’s CppCon2014 talk if you’re interested in those subjects!

      • Thanks again Stefan, your posts are really helpful, I’m in the middle of developing a new engine and sadly seems that nobody (wants to) share its knowledge, I’m happy to see an exception!
        keep going 😉

  7. Pingback: Stateless, layered, multi-threaded rendering – Part 4: Memory Management & Synchronization | Molecular Musings

  8. I’m curious. How does a sorted key render-queue system handle something like hardware geometry instancing?

    If there is any depth sorting at all it seems to me that objects with the same material/mesh keys won’t be next to each other in the queue. Depth sorting with instanced objects almost seems pointless with several instances because they are not guaranteed to be contiguously depth sorted with the rest of the scene. Depth sorting might have value amidst instances of the same type of object, though. Otherwise if an object-type ID is included in the hash (to make the geometry instances contiguous in the sorting) it would take up precious key space and have a hard-limit on the number of unique instance-able objects in the scene.

    I enjoy the elegance of the hash-key sorting approach, but I’m not sure how to tackle this instancing problem without annoying special-treatment cases or wasteful usage of the hash key.

    • That’s a tough question.

      Depending on what kind of scene you render, and what rendering technique you use (deferred, forward+, …), it might be beneficial to do a Z-prepass and completely forget about depth sorting for opaque objects if you have lots of instanced objects. On the other hand, you don’t have to do strict depth sorting for opaque objects anyway. If you build the key accordingly, you can have a sort order that minimizes material/shader/texture changes, and still does rough front-to-back sorting for objects that share the same material. It’s not 100% accurate sorting, but better than nothing in this case – and you basically get it for free.

      For transparent objects, that’s a different story though. But the question is, how many instanced, transparent objects does your scene have? Is it crucial to have perfect back-to-front sorting, or can you get away with the occasional unsorted object because the viewer won’t notice anyway?

      All in all, I don’t think there is a silver bullet for this.

  9. Pingback: New Sideproject, C# Engine? | wumpfblog

Leave a Reply to Stefan Reinalter Cancel reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.