Job System 2.0: Lock-Free Work Stealing – Part 5: Dependencies

In today’s post, we will finally take a look at the last remaining piece of the new job system: adding dependencies between jobs.

Other posts in the series

Part 1: Describes the basics of the new job system, and summarizes work-stealing.
Part 2: Goes into detail about the thread-local allocation mechanism.
Part 3: Discusses the lock-free implementation of the work-stealing queue.
Part 4: Describes high-level algorithms.

Keeping it simple

For the new job system, I really wanted to keep the complexity of the implementation down compared to the previous approach in version 1.0. Sure, the lock-free work-stealing queue is quite complex in its implementation, but that piece of code is self-contained, keeping the rest of the code relatively easy to understand.

Introducing dependencies to the job system in the past always meant that a second queue had to be introduced (for storing jobs that cannot be executed yet), existing functions had to be re-evaluated, and some of the code had to be re-written in order to take dependencies into account.

This time around I wanted to do the simplest thing possible: letting the user state dependencies explicitly by means of continuations.

Instead of saying “whenever this job finishes, try to run any of the other jobs that were maybe waiting on it”, we now explicitly say “as soon as this job finishes, run all its continuations immediately.”

Running jobs immediately in this case means pushing them into the job queue, so that the load-balancing can take over and make sure that all cores/threads are equally getting work done.

In code, this means that we now do the following:

AddContinuation(ancestor, dependency1);
AddContinuation(ancestor, dependency2);

The simplest solution to implementing this would be to store all the continuations directly as data in our Job struct. Which is exactly what we’re going to do!

Supporting continuations

Bumping the Job struct from 64 to 128 bytes in size allows us to store the following:

struct Job
{
  JobFunction function;
  Job* parent;
  int32_t unfinishedJobs;
  char data[52];
  int32_t continuationCount; // new
  Job* continuations[15];    // new
};

In production code, we can steal a few bits here and there because we don’t need to store pointers but can store offsets instead. And even with offsets, we probably don’t need a full 16-bit offset.

This essentially leaves us with support for 52 bytes of data and 16 continuations per job. Or more space for data, but fewer continuations. We could even merge the data and continuations array, and store data “dynamically” with a few runtime asserts thrown into the mix.

Adding a continuation to an existing job is straightforward:

void AddContinuation(Job* ancestor, Job* continuation)
{
  const int32_t count = atomic::Increment(&ancestor->continuationCount);
  ancestor->continuations[count - 1] = continuation;
}

Of course we would need to make sure that the ancestor job hasn’t been Run() yet, and that there is still space in the continuations array. Note that incrementing the number of continuations has to be done atomically to ensure that we don’t introduce data races in case other threads try to add other continuations.

Adapting the job system to support continuations is simple as well, we just have to slightly alter the Finish() function:

void Finish(Job* job)
{
  [...]
  if (job->parent)
  {
    Finish(job->parent);
  }

  // run follow-up jobs
  for (int32_t i=0; i < job->continuationCount; ++i)
  {
    PushToQueue(job->continuations[i]);
  }
  [...]
}

And that’s pretty much it, staying true to the KISS mantra. The new implementation has a few advantages compared to the old approach:

  • Simple implementation.
  • Plays nicely with parent-child relationships, everything works out of the box.
  • No more (possibly) unbounded stack space needed, all memory allocated up-front.

Of course, each Job now needs twice the amount of memory, but in times where even mobiles offer 512MB memory, I don’t see that as a big problem. Furthermore, for architectures that use 128-byte cache lines we would have wanted to bump the size anyway in order to avoid false sharing.

Conclusion

Finally, this part concludes the Job System 2.0 series. I hope you liked it!

18 thoughts on “Job System 2.0: Lock-Free Work Stealing – Part 5: Dependencies

  1. Same basic concept as here:
    http://cbloomrants.blogspot.com/2011/12/12-03-11-worker-thread-system-with.html
    http://cbloomrants.blogspot.com/2012/11/11-08-12-job-system-task-types.html
    Only extra addition from there that I like is a flag per continuation to make the function execute immediately inline rather than being pushed as a new job.
    This makes it more obviously okay to work around the base system limitations by adding extra little helper continuations. ( Namely chaining continuations to support more than 16 dependencies and support for depending on already running tasks by having a continuation that safely handles late binding dependencies. )

  2. On a 64bit system the pointers are 8 bytes long, so there is very little room left inside a job for both padding data and job continuations, even for a 128byte Job struct. After your fashion, I can only fit half the number of continuations in the same space.

    Also, I noticed that in your benchmarks you create 65000 jobs initially. Did you have to bump `MAX_JOB_COUNT` to 65536 for this to work? In which case with 8 threads*128 bytes/Job* 65536 Jobs/thread = 64MB, the memory footprint of this Jobsystem becomes rather large.

    How would you remedy this?

    • On a 64bit system the pointers are 8 bytes long, so there is very little room left inside a job for both padding data and job continuations, even for a 128byte Job struct. After your fashion, I can only fit half the number of continuations in the same space.

      You’re right. The code I presented uses plain pointers because I didn’t want to clutter the main idea with implementation details. That being said, in production code (assuming we use the allocator I posted in Part 2) I would only store indices into the arrays.
      If we assume 32 threads with each of them potentially storing 2048 jobs (which is a lot!), you only need 5 bits for the thread ID, and 11 bits for the job index, so exactly 16 bits. I would only store these 16 bits instead of a full Job* in the struct.

      Also, I noticed that in your benchmarks you create 65000 jobs initially. Did you have to bump `MAX_JOB_COUNT` to 65536 for this to work? In which case with 8 threads*128 bytes/Job* 65536 Jobs/thread = 64MB, the memory footprint of this Jobsystem becomes rather large.

      How would you remedy this?

      I wouldn’t make space for so many jobs because you don’t need that. I did that for testing purposes in order to have something that takes a few milliseconds so I can neglect measurement overhead.
      Even in a large AAA game you probably have around 4000-5000 jobs each frame, which roughly amounts to 4MB in an 8-core system. Considering the fact that we only need that on a machine with 8 physical (or at least logical) cores that we can use anyway, I’d consider that to be a beefy machine where even an extra 32MB shouldn’t be a problem.
      And if that still is a problem, you can always trade performance for memory, and use a multi-threaded linear allocator (as described in Part 2) using atomic operations, which means you only need to worry about the max. number of jobs per frame, but not per thread.

      • I had sort of missed the implementation detail that your parallel_for is likely to distribute the jobs over several threads due to stealing in the setup phase. I guess that with 4000-5000 jobs per frame, it is unlikely that all of these are added to one thread’s job queue. Otherwise, the system wouldn’t be very balanced, and possibly lead to a “thundering hurd” of threads always trying to steal from the main thread.

        I was messing with using a 32bit integer for storing job offsets, but I see now that your idea of using only 16 bits would work out nicely. That would certainly free up more space for e.g. member function pointers (which take up alot of space) on a 64 bit system.

        Thanks for the reply Stefan, your blog is pure gold!

  3. I had sort of missed the implementation detail that your parallel_for is likely to distribute the work in the splitting phase, as those jobs are likely to be stolen by the worker threads. That sure takes down the max job number requirement for each thread, and leads to better balancing.

    I was messing with using 32bit offsets/handles for both job continutions and for the parent, and got that working nicely. I see that your idea of using only 16 bit for offsets would work out and give me more space for e.g. member function pointers (which take up alot of space on 64 bit systems).

    Thanks for the reply Stefan, your blog is pure gold!

    • That’s a lot of slides! Once it eventually gets into the using dependencies to give inner parallelism part it gets quite good. I’ve written a similar (toy) system and also needed to add a pretty fancy event system to break dependency chains between systems. It seems like a promising direction to me; but I’ve since started playing with a new language design instead, basically because of all the safety issues. I’d be worried that it would take an unreasonable amount of self control to write hundreds of random little game components in such a system, avoiding globals, statics and such… maybe with extensive code review…

      • Oh I agree, I don’t see using it in a production environment but I did enjoy the focus on the compile time approach.

  4. Hi, I was wondering if you have any ideas how to pass static arguments to the jobs? Lets say that I want to pass DeltaTime to each and every job, I currently have a copy of ParallelFor, JobData etc which specifically takes an extra float and passes it into the function. There must be a better way to do this, I was looking into a way to use a variadic function to pass any amount of parameters into the function but got stuck on the fact that I need to pass it as a variadic into the ParallelFor, save it in the ParallelForJobData and then finally pass it as a variadic into the job function.

    I am still working on getting this kind of general implementation to work but in the meanwhile I was wondering if you had encountered the same “problem” and came to another solution.

    • Hi,

      I think the two reasonable approaches are:

      • Store your extra arguments somewhere in the job’s data, e.g. use an extra 64 bytes per job which can be used to store whatever you want. If I remember correctly, this was the approach taken by certain SPU libraries on the PS3.
      • Use an empty base class and pass a pointer to each job. In the jobs, cast the pointer to the concrete implementation and access your data as usual. This has the drawback of potential heap allocations, but you can write a separate allocator for that.
  5. Hi,

    Thanks for the articles, I had fun playing with theses ideas.

    I’m not sure of the intended use for continuations, though.

    If a job J has a parent P and a continuation C1. And P has a continuation C2, then in your implementation, P may finish and launch C2 before C1 has completed, or did I misunderstood something ?

    I was playing with a parallel_merge_sort example, and I wonder how you would create the jobs dynamically so that each SPLIT has to do 2 sub-SPLITS and only then one MERGE in order to be considered finished by the parent.

    By reading the article, I thought I could do

    > each SPLIT spawns 2 SPLIT children jobs, and one continuation MERGE job is spawned upon completion. But going recursively it does not work because a SPLIT could complete (thus spawn a MERGE) before the MERGE of its children are finished.

    So the only way I can think of is
    > each SPLIT spawns 2 SPLIT children jobs and wait for both of them, then it MERGE (inline, without spawning a job)

    But then the jobs are staying around waiting so the job allocator will quickly run out of jobs..

    So the questions are
    – Did I understand correctly the completion mechanism ?
    – How would you organize the jobs for this application ?
    – Can you give examples of continuation jobs that can be spawned but not waited for ?

    Thanks for your articles !

    • For applications such as parallel_merge_sort, you might have to implement/modify functionality that lets you run continuations as soon as a parent (and all its children) finishes.
      Or did you find another solution to solving this?

      • Yes I found a solution that did not involve waiting nor modifying the continuation mechanism.

        SPLIT has SPLIT_right and MERGE as children
        but the MERGE is a continuation of the SPLIT_left job (which also has SPLIT_right as a child).

        This solution ensures that the MERGE continuation cannot occur before both SPLIT_Left and SPLIT_Right are finished, and it also ensures that SPLIT cannot complete until the MERGE is finished.

        This parallel_merge_sort example was nothing but a bench to test that no deadlocks had been introduced while tweaking the lib (I wanted infinite waiting instead of yielding, and sending the minimum amount of events to wake the waiters). It works great.

        Thanks again for your articles

  6. These articles are great, wondering if you would consider open-sourcing the code in your articles? I believe you will be famous forever

  7. What about braided parallelism? Namely, a job creates more jobs and needs to put them in the queue. I think the problem here is that you need to push these new jobs in front of the queue.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.