Job System 2.0: Lock-Free Work Stealing – Part 2: A specialized allocator

As promised in the last post, today we will be looking at how to get rid of new and delete when allocating jobs in our job system. Allocations can be dealt with in a much more efficient way, as long as we are willing to sacrifice some memory for that. The resulting performance improvement is huge, and certainly worth it..

Why using new and delete is slow

As has been stated before, new basically calls a general purpose allocator behind the scenes. That allocator has to cater to both small, medium and (very) large allocations. Additionally, new is inherently thread-safe, using synchronization primitives like a mutex or critical section to protect mutable, shared data from races.

Therefore, it should come as no surprise that we should be able to beat the performance of new and delete in such a special case as ours. Even though the performance and characteristics of general purpose allocators are constantly being improved upon, they will never be able to beat a custom-tailored allocator.

Also remember that allocating instances using new forces us to call delete on those instances in order to avoid memory leaks. This, in turn, constrained us to defer deletion of jobs to the end of a frame, storing them in an auxiliary array. Additionally, we needed yet another atomic counter in the Finish() function.

A pool allocator?

Our job system only does allocations of type Job, so using a pool allocator/free list sounds like it would be a perfect fit. Using a (lock-free) pool allocator indeed improves performance, but still doesn’t solve the problem of having to defer the deletion of job instances.

A specialized allocator

The crucial thing to realize is that we do the same thing over and over again, every frame. We spawn and allocate N jobs, and delete all N of them at the end of a frame. So why not get rid of allocating and freeing jobs altogether?

This can easily be done by using a global array of pre-allocated Job instances. Because our Job struct is a POD type, we need not worry about calling constructors or destructors. We can allocate an array of e.g. 4096 jobs once when initializing the job system, use this array as a ring buffer from which allocation requests are served, and free the array when shutting down the job system completely, e.g. at application exit.

All we need for this to work is a global array of Jobs, and one atomic counter that serves as a kind of lock-free ring buffer allocator, if you want to call it that. The Allocate() function then simply becomes:

static Job g_jobAllocator[MAX_JOB_COUNT];
static uint32_t g_allocatedJobs = 0u;

Job* AllocateJob(void)
{
  const uint32_t index = atomic::Increment(&g_allocatedJobs);
  return &g_jobAllocator[(index-1) % MAX_JOB_COUNT];
}

As you probably know, the modulus expression (index-1) % MAX_JOB_COUNT can be turned into a binary AND operation in case MAX_JOB_COUNT is a power-of-two, e.g. 4096:

return &g_jobAllocator[(index-1u) & (MAX_JOB_COUNT-1u)];

As can be seen, the atomic counter is a monotonically increasing integer that never gets reset, hence we access the global g_jobAllocator array using a modulus operation, effectively turning it into a ring-buffer.
In production code, you should make sure to put in additional measures that ensure that no more than MAX_JOB_COUNT jobs get allocated in a single frame. Additionally, you could memset() the Job instance (or timestamp it) to make sure that code doesn’t touch entries from past frames.

One important thing to realize with this approach is that we no longer have to delete jobs during a frame. This means that we can get rid of the global auxiliary array, and the corresponding atomic counter, leading to a nicely simplified implementation for the Finish() function:

void Finish(Job* job)
{
  const int32_t unfinishedJobs = atomic::Decrement(&job->unfinishedJobs);

  if ((unfinishedJobs == 0) && (job->parent))
  {
    Finish(job->parent);
  }
}

The amount of pre-allocated memory this approach needs is negligible in the grand scheme of things. With 4096 jobs per frame, the array needs 4096*64 = 256 KB, which is nothing on today’s platforms.

Even though this implementation is a nice step-up from the original approach in terms of performance, we can still do better. Avid readers should know by now what’s coming next.

Going thread-local

So how can we do better than just using one atomic operation per allocation? Using no atomic operation at all, of course! Atomic operations are much cheaper than mutexes and the likes, but they aren’t free either.

By moving both the counter and the pre-allocated array of Jobs to thread-local storage, we no longer need any atomic operation for allocating a job from the ring-buffer:

Job* AllocateJob(void)
{
  const uint32_t index = g_allocatedJobs++;
  return &g_jobAllocator[index & (MAX_JOB_COUNT-1u)];
}

In the example code above, g_allocatedJobs and g_jobAllocator are allocated on thread-local storage. Notice the absence of any atomic operation or expensive function call. This is as cheap as it gets.

One further thing to note is that this needs more memory compared to our previous approach, because all worker threads now need a pre-allocated array of MAX_JOB_COUNT Job instances. In case of 8 worker threads, this would bump the memory requirements from 256 KB to 2 MB, which I still consider to be a small amount, at least on an 8-core machine.

Performance

Again, performance was measured on an Intel Core i7-2600K CPU clocked at 3.4 GHz, having 4 physical cores with Hyperthreading (= 8 logical cores).

With the improved implementation, the running times are now as follows:

Old New Perf. increase
Single jobs 18.5 ms 9.9 ms 1.86x
parallel_for 5.3 ms 1.35 ms 3.92x

Quite something for a rather small change!

Outlook

Next time, we will be tackling the meat of the job system: the implementation of a lock-free work-stealing queue.

Advertisements

9 thoughts on “Job System 2.0: Lock-Free Work Stealing – Part 2: A specialized allocator

  1. I like the idea of “specialized allocator” and clearing all the tasks at the end of a frame, but what about other, long-running tasks, for example resources loading. How do you handle them? With another threadpool?

  2. I’m quite partial to using coroutines (kind of old school and heavy, I know) to implement concurrency. One of the things that I liked when first working with them is that deferred cleanup had intuitive solution. It was just slotted into the Synchronization and Context Switch code that already had to exist.

    I don’t know if I’d actually do this in production code though. I know some commercial engines use coroutines but I’m not privy to the actual implementation details.

  3. Pingback: Job System 2.0: Lock-Free Work Stealing – Part 3: Going lock-free | Molecular Musings

  4. Pingback: Job System 2.0: Lock-Free Work Stealing – Part 4: parallel_for | Molecular Musings

  5. Pingback: Job System 2.0: Lock-Free Work Stealing – Part 5: Dependencies | Molecular Musings

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s