Building a load-balanced task scheduler – Part 1: Basics

With multicore hardware becoming the norm in both PC/console-based gaming as well as on mobile platforms, it is crucial to take advantage of every processor core and thread being thrown at us developers. Therefore, it is important to build technology alleviating the task of writing correct, multi-threaded code.

In order to achieve that, this series will try to explain how to build a load-balancing, work-stealing task scheduler.

In comparison to manually threading your code, automatic task scheduling has certain benefits:

  • Better scalability, easier to take advantage of N cores/threads by automatic load-balancing.
  • Less error-prone, dependencies can be expressed as simple parent-child relationships between tasks.
  • Easier to gain benefits from both function and data parallelism.

In my opinion, the biggest benefit you can get from a task scheduler is increased parallelism of common per-frame tasks (e.g. animation update, collision detection, rendering). On the other hand, perpetual tasks taking several frames to complete like streaming geometry, audio, etc. are better done by hand.

The simplest scheduler

Before we start dealing with more advanced techniques like load-balancing and work-stealing, let us try to build a really simple scheduler which does nothing more than delegating independent tasks to multiple cores.

The basic idea of a task scheduler is simple:

  • Create a worker thread for each available core or hardware thread in the system. Each worker thread helps in executing tasks handed to the scheduler.
  • Put tasks into a queue, with each worker thread pulling tasks to be worked on from this queue.

On fixed hardware (e.g. consoles), you can make sure to create as many threads as needed to not have any over- or under-subscription of threads in the system. Additionally, you can assign fixed hardware threads to each worker thread, which makes performance somewhat more predictable. On the PS3, each available SPU can be considered a “worker thread”.

On other platforms like the PC, in practice there’s not much difference between creating num_cores or num_cores – 1 threads, because the OS’ context switches cannot be predicted or controlled anyway.

In pseudocode, the scheduler implementation could look like the following:

// at startup:
CreateWorkerThreads(numAvailableHardwareThreads);

for each (WorkerThread)
{
  WorkOnTasks();
}

// submit work during a frame:
for each 1000 particles
{
  AddTaskToScheduler(particleData + N*1000, 1000);
}

// ...
// do other work in between, and finally synchronize
WaitUntilTasksAreFinished();

Not concerning ourselves with advanced features at the moment, AddTaskToScheduler() simply adds a task to a global queue, and could be implemented as follows:

AddTaskToScheduler(task)
{
  LockSynchronizationPrimitive();
  globalTaskQueue.Add(task);
  UnlockSynchronizationPrimitive();
}

What’s left is the WorkOnTasks() function which is being run by each of the worker threads:

while (threadShouldRun)
{
  WaitUntilTaskIsAvailable();

  LockSynchronizationPrimitive();
  task = globalTaskQueue.GetAndRemove();
  UnlockSynchronizationPrimitive();

  Execute(task);
}

I specifically left out C++ implementation details for the time being. One thing to note in the current implementation is that tasks are being added and removed to/from a global queue, which of course needs proper synchronization across multiple threads. Disregarding the choice of the underlying synchronization primitive (e.g. mutex, critical section) actually being used, these synchronization spots will cause lots of thread contention. The more tasks and threads we have, the more contention this will cause – this is where work-stealing comes into play, which will be discussed in a future part of the series.

One last thing still missing is the implementation of WaitUntilTasksAreFinished(): The simplest solution in terms of implementation complexity is to simply wait until the are no more items in the queue. This, however, needs another lock to be taken when accessing the global queue, possibly creating even more contention. A better solution is to use an atomic variable which denotes the number of tasks still available.

Implementation details

In case you want to go and implement a simple scheduler yourself, here are a few starting points:

  • Use a simple synchronization primitive like a critical section for protecting the queue from concurrent access. Using a spin-lock or busy-looping on a volatile variable can (and will) cause problems on other platforms. Performance benefits are gained by reducing the causes of contention in the first place, not making individual lock operations marginally faster. Removing sources of contention is being dealt with in future parts of the series.
  • Use a condition variable when waiting for tasks to become available. Worker threads wait on the condition variable to become signaled, while AddTaskToScheduler() signals the condition as soon as the task has been added. This also has the added benefit that worker threads don’t need CPU resources while no task is in the queue.
  • Atomic variables can be implemented using the Interlocked* functions on Windows. Other platforms/OSs should provide similar functions – if not, you can always make use of the hardware instructions using inline assembler code (LOCK prefix on x86).

Deficiencies and outlook

Concluding this post, let us look at the deficiencies of our current, simple scheduler:

  • Lock operations on the global queue create lots of thread contention.
  • Adding tasks to the system can only be done in a serial manner.
  • No load-balancing. Even if every task contains the same amount of work to be done, there is no guarantee that threads finish their task at the same time, leading to an unbalanced system where worker threads will be sitting idle.
  • No dependencies between tasks.
  • When waiting for tasks to be finished, the calling thread sits idle until the worker threads have finished.

Even though simple to implement and a good starting point, this basic scheduler needs to be improved a lot before it can provide advantages discussed in the beginning of this article. Future posts will deal with how to adhere to dependencies, and how to automatically load-balance the work across N cores.

Advertisements

10 thoughts on “Building a load-balanced task scheduler – Part 1: Basics

  1. Looking forward to reading the rest of this series. Intel has a couple of articles on tasking that are PC focused rather than cross platform, but have the same basic ideas that you describe. Though they do use spin locks and volatile variables. I was wondering if you had seen them?

  2. Yes, the Intel articles you mentioned are probably about TBB and Nulstein : http://software.intel.com/en-us/articles/do-it-yourself-game-task-scheduling/.

    Using spin locks can cause problems on other platforms, and using volatile variables can lead to very subtle race conditions discussed in an older post on my blog: https://molecularmusings.wordpress.com/2012/03/05/volatile-thread-synchronization/

    You *can* use volatile variables and spin locks, but only when using the correct memory and/or compiler barriers, which is hard to get right.

  3. Pingback: Building a load-balanced task scheduler – Part 3: Parent-child relationships | Molecular Musings

  4. Pingback: Real-time radiosity | Molecular Musings

  5. Pingback: Building a load-balanced task scheduler – Part 4: False sharing | Molecular Musings

  6. Great articles you have on your blog ! You mentioned work stealing at the start of this post. From what I know you would need a queue for each worker thread. Have you implemented it or you chose another approach ?

    • Thanks!

      Yes, basically you need a global queue and a local queue for each worker thread. Tasks added from worker threads are added to the local queue, and each thread works on its own queue. If its local queue is empty, the thread is allowed to steal from any other thread’s local queue.
      Usually, these local queues are implemented with having two ends: one public end (FIFO), and one private end (LIFO). The LIFO end leads to better cache utilization, while the FIFO end provides for good balancing. Additionally, it is possible to implement this two-ended queue in a lock-free fashion, which can be an additional win. All in all, work stealing can greatly reduce contention, but adds quite a bit of complexity to the implementation.

      I haven’t implemented work stealing in Molecule yet, but have done so in the past for a multi-platform engine I workd on. Right now, I prefer the simplicity and more or less straight-forward implementation of the solution described on this blog over a more complex approach. But Molecule might possibly get a work-stealing scheduler in the future :).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s