With multicore hardware becoming the norm in both PC/console-based gaming as well as on mobile platforms, it is crucial to take advantage of every processor core and thread being thrown at us developers. Therefore, it is important to build technology alleviating the task of writing correct, multi-threaded code.
In order to achieve that, this series will try to explain how to build a load-balancing, work-stealing task scheduler.
In comparison to manually threading your code, automatic task scheduling has certain benefits:
- Better scalability, easier to take advantage of N cores/threads by automatic load-balancing.
- Less error-prone, dependencies can be expressed as simple parent-child relationships between tasks.
- Easier to gain benefits from both function and data parallelism.
In my opinion, the biggest benefit you can get from a task scheduler is increased parallelism of common per-frame tasks (e.g. animation update, collision detection, rendering). On the other hand, perpetual tasks taking several frames to complete like streaming geometry, audio, etc. are better done by hand.
The simplest scheduler
Before we start dealing with more advanced techniques like load-balancing and work-stealing, let us try to build a really simple scheduler which does nothing more than delegating independent tasks to multiple cores.
The basic idea of a task scheduler is simple:
- Create a worker thread for each available core or hardware thread in the system. Each worker thread helps in executing tasks handed to the scheduler.
- Put tasks into a queue, with each worker thread pulling tasks to be worked on from this queue.
On fixed hardware (e.g. consoles), you can make sure to create as many threads as needed to not have any over- or under-subscription of threads in the system. Additionally, you can assign fixed hardware threads to each worker thread, which makes performance somewhat more predictable. On the PS3, each available SPU can be considered a “worker thread”.
On other platforms like the PC, in practice there’s not much difference between creating num_cores or num_cores – 1 threads, because the OS’ context switches cannot be predicted or controlled anyway.
In pseudocode, the scheduler implementation could look like the following:
// at startup: CreateWorkerThreads(numAvailableHardwareThreads); for each (WorkerThread) { WorkOnTasks(); } // submit work during a frame: for each 1000 particles { AddTaskToScheduler(particleData + N*1000, 1000); } // ... // do other work in between, and finally synchronize WaitUntilTasksAreFinished();
Not concerning ourselves with advanced features at the moment, AddTaskToScheduler() simply adds a task to a global queue, and could be implemented as follows:
AddTaskToScheduler(task) { LockSynchronizationPrimitive(); globalTaskQueue.Add(task); UnlockSynchronizationPrimitive(); }
What’s left is the WorkOnTasks() function which is being run by each of the worker threads:
while (threadShouldRun) { WaitUntilTaskIsAvailable(); LockSynchronizationPrimitive(); task = globalTaskQueue.GetAndRemove(); UnlockSynchronizationPrimitive(); Execute(task); }
I specifically left out C++ implementation details for the time being. One thing to note in the current implementation is that tasks are being added and removed to/from a global queue, which of course needs proper synchronization across multiple threads. Disregarding the choice of the underlying synchronization primitive (e.g. mutex, critical section) actually being used, these synchronization spots will cause lots of thread contention. The more tasks and threads we have, the more contention this will cause – this is where work-stealing comes into play, which will be discussed in a future part of the series.
One last thing still missing is the implementation of WaitUntilTasksAreFinished(): The simplest solution in terms of implementation complexity is to simply wait until the are no more items in the queue. This, however, needs another lock to be taken when accessing the global queue, possibly creating even more contention. A better solution is to use an atomic variable which denotes the number of tasks still available.
Implementation details
In case you want to go and implement a simple scheduler yourself, here are a few starting points:
- Use a simple synchronization primitive like a critical section for protecting the queue from concurrent access. Using a spin-lock or busy-looping on a volatile variable can (and will) cause problems on other platforms. Performance benefits are gained by reducing the causes of contention in the first place, not making individual lock operations marginally faster. Removing sources of contention is being dealt with in future parts of the series.
- Use a condition variable when waiting for tasks to become available. Worker threads wait on the condition variable to become signaled, while AddTaskToScheduler() signals the condition as soon as the task has been added. This also has the added benefit that worker threads don’t need CPU resources while no task is in the queue.
- Atomic variables can be implemented using the Interlocked* functions on Windows. Other platforms/OSs should provide similar functions – if not, you can always make use of the hardware instructions using inline assembler code (LOCK prefix on x86).
Deficiencies and outlook
Concluding this post, let us look at the deficiencies of our current, simple scheduler:
- Lock operations on the global queue create lots of thread contention.
- Adding tasks to the system can only be done in a serial manner.
- No load-balancing. Even if every task contains the same amount of work to be done, there is no guarantee that threads finish their task at the same time, leading to an unbalanced system where worker threads will be sitting idle.
- No dependencies between tasks.
- When waiting for tasks to be finished, the calling thread sits idle until the worker threads have finished.
Even though simple to implement and a good starting point, this basic scheduler needs to be improved a lot before it can provide advantages discussed in the beginning of this article. Future posts will deal with how to adhere to dependencies, and how to automatically load-balance the work across N cores.
Looking forward to reading the rest of this series. Intel has a couple of articles on tasking that are PC focused rather than cross platform, but have the same basic ideas that you describe. Though they do use spin locks and volatile variables. I was wondering if you had seen them?
Yes, the Intel articles you mentioned are probably about TBB and Nulstein : http://software.intel.com/en-us/articles/do-it-yourself-game-task-scheduling/.
Using spin locks can cause problems on other platforms, and using volatile variables can lead to very subtle race conditions discussed in an older post on my blog: https://molecularmusings.wordpress.com/2012/03/05/volatile-thread-synchronization/
You *can* use volatile variables and spin locks, but only when using the correct memory and/or compiler barriers, which is hard to get right.
I was referring to: http://software.intel.com/en-us/articles/using-tasking-to-scale-game-engine-systems/ and http://software.intel.com/en-us/articles/comparative-performance-of-game-threading-apis-in-task-managers/ Full disclosure: I wrote the second one. I don’t want to hijack the comments here but I would be interesting in getting your thoughts on the articles.
I liked both articles, nicely written.
My approach to parent-child relationships and dependencies is a bit different, and even though one of the articles touches upon the subject of first writing your results to different memory locations, and then summing them in parallel, they didn’t mention False Sharing – which is something I want to talk about, because it can be a total performance killer if you’re not careful.
You’re right. The code related to the articles has a few comments about false sharing. Thanks for taking the time to read those! Looking forward to the rest of this series!
Pingback: Building a load-balanced task scheduler – Part 3: Parent-child relationships | Molecular Musings
Pingback: Real-time radiosity | Molecular Musings
Pingback: Building a load-balanced task scheduler – Part 4: False sharing | Molecular Musings
Great articles you have on your blog ! You mentioned work stealing at the start of this post. From what I know you would need a queue for each worker thread. Have you implemented it or you chose another approach ?
Thanks!
Yes, basically you need a global queue and a local queue for each worker thread. Tasks added from worker threads are added to the local queue, and each thread works on its own queue. If its local queue is empty, the thread is allowed to steal from any other thread’s local queue.
Usually, these local queues are implemented with having two ends: one public end (FIFO), and one private end (LIFO). The LIFO end leads to better cache utilization, while the FIFO end provides for good balancing. Additionally, it is possible to implement this two-ended queue in a lock-free fashion, which can be an additional win. All in all, work stealing can greatly reduce contention, but adds quite a bit of complexity to the implementation.
I haven’t implemented work stealing in Molecule yet, but have done so in the past for a multi-platform engine I workd on. Right now, I prefer the simplicity and more or less straight-forward implementation of the solution described on this blog over a more complex approach. But Molecule might possibly get a work-stealing scheduler in the future :).
Hi Stefan! Working my way through the great articles on your blog and I have some questions for this one in particular.
In your description of the implementation of WaitUntilTasksAreFinished() you mention “waiting until there are no more items in the queue”, but that it’s not advised since it’d require another lock. Why is a lock required for a read? I’m assuming this just means while looping on some variable of the queue until it shows that no tasks remain.
I ask because I’m not sure how the alternative you suggest–using an atomic variable which denotes the number of tasks still available–would be different, since the queue is only ever accessed in a critical section to begin with. I’m assuming this atomic variable would have to be while looped on as well. How would the atomic variable change things?
I know this is an older article of yours but I’d really appreciate the feedback! Thank you!
In this case, it doesn’t really matter whether you use a lock or an atomic variable – both are bad, because you’re busy-waiting on the queue to become empty without doing meaningful work in the meantime. Having an atomic variable that denotes the number of tasks still available allows you to read this variable without taking a lock.
Depending on how your queue is implemented, you might not be able to read the number of elements stored without taking a lock – other threads could push new tasks into it at the same time. E.g. if the queue has to reallocate during a push, your read would fail, but it really depends on the underlying data structure.
That makes a lot of sense. Thanks!