Job System 2.0: Lock-Free Work Stealing – Part 5: Dependencies

In today’s post, we will finally take a look at the last remaining piece of the new job system: adding dependencies between jobs.

Other posts in the series

Part 1: Describes the basics of the new job system, and summarizes work-stealing.
Part 2: Goes into detail about the thread-local allocation mechanism.
Part 3: Discusses the lock-free implementation of the work-stealing queue.
Part 4: Describes high-level algorithms.

Keeping it simple

For the new job system, I really wanted to keep the complexity of the implementation down compared to the previous approach in version 1.0. Sure, the lock-free work-stealing queue is quite complex in its implementation, but that piece of code is self-contained, keeping the rest of the code relatively easy to understand.

Introducing dependencies to the job system in the past always meant that a second queue had to be introduced (for storing jobs that cannot be executed yet), existing functions had to be re-evaluated, and some of the code had to be re-written in order to take dependencies into account.

This time around I wanted to do the simplest thing possible: letting the user state dependencies explicitly by means of continuations.

Instead of saying “whenever this job finishes, try to run any of the other jobs that were maybe waiting on it”, we now explicitly say “as soon as this job finishes, run all its continuations immediately.”

Running jobs immediately in this case means pushing them into the job queue, so that the load-balancing can take over and make sure that all cores/threads are equally getting work done.

In code, this means that we now do the following:

AddContinuation(ancestor, dependency1);
AddContinuation(ancestor, dependency2);

The simplest solution to implementing this would be to store all the continuations directly as data in our Job struct. Which is exactly what we’re going to do!

Supporting continuations

Bumping the Job struct from 64 to 128 bytes in size allows us to store the following:

struct Job
{
  JobFunction function;
  Job* parent;
  int32_t unfinishedJobs;
  char data[52];
  int32_t continuationCount; // new
  Job* continuations[15];    // new
};

In production code, we can steal a few bits here and there because we don’t need to store pointers but can store offsets instead. And even with offsets, we probably don’t need a full 16-bit offset.

This essentially leaves us with support for 52 bytes of data and 16 continuations per job. Or more space for data, but fewer continuations. We could even merge the data and continuations array, and store data “dynamically” with a few runtime asserts thrown into the mix.

Adding a continuation to an existing job is straightforward:

void AddContinuation(Job* ancestor, Job* continuation)
{
  const int32_t count = atomic::Increment(&ancestor->continuationCount);
  ancestor->continuations[count - 1] = continuation;
}

Of course we would need to make sure that the ancestor job hasn’t been Run() yet, and that there is still space in the continuations array. Note that incrementing the number of continuations has to be done atomically to ensure that we don’t introduce data races in case other threads try to add other continuations.

Adapting the job system to support continuations is simple as well, we just have to slightly alter the Finish() function:

void Finish(Job* job)
{
  [...]
  if (job->parent)
  {
    Finish(job->parent);
  }

  // run follow-up jobs
  for (int32_t i=0; i < job->continuationCount; ++i)
  {
    PushToQueue(job->continuations[i]);
  }
  [...]
}

And that’s pretty much it, staying true to the KISS mantra. The new implementation has a few advantages compared to the old approach:

  • Simple implementation.
  • Plays nicely with parent-child relationships, everything works out of the box.
  • No more (possibly) unbounded stack space needed, all memory allocated up-front.

Of course, each Job now needs twice the amount of memory, but in times where even mobiles offer 512MB memory, I don’t see that as a big problem. Furthermore, for architectures that use 128-byte cache lines we would have wanted to bump the size anyway in order to avoid false sharing.

Conclusion

Finally, this part concludes the Job System 2.0 series. I hope you liked it!

Advertisements

12 thoughts on “Job System 2.0: Lock-Free Work Stealing – Part 5: Dependencies

  1. Same basic concept as here:
    http://cbloomrants.blogspot.com/2011/12/12-03-11-worker-thread-system-with.html
    http://cbloomrants.blogspot.com/2012/11/11-08-12-job-system-task-types.html
    Only extra addition from there that I like is a flag per continuation to make the function execute immediately inline rather than being pushed as a new job.
    This makes it more obviously okay to work around the base system limitations by adding extra little helper continuations. ( Namely chaining continuations to support more than 16 dependencies and support for depending on already running tasks by having a continuation that safely handles late binding dependencies. )

  2. On a 64bit system the pointers are 8 bytes long, so there is very little room left inside a job for both padding data and job continuations, even for a 128byte Job struct. After your fashion, I can only fit half the number of continuations in the same space.

    Also, I noticed that in your benchmarks you create 65000 jobs initially. Did you have to bump `MAX_JOB_COUNT` to 65536 for this to work? In which case with 8 threads*128 bytes/Job* 65536 Jobs/thread = 64MB, the memory footprint of this Jobsystem becomes rather large.

    How would you remedy this?

    • On a 64bit system the pointers are 8 bytes long, so there is very little room left inside a job for both padding data and job continuations, even for a 128byte Job struct. After your fashion, I can only fit half the number of continuations in the same space.

      You’re right. The code I presented uses plain pointers because I didn’t want to clutter the main idea with implementation details. That being said, in production code (assuming we use the allocator I posted in Part 2) I would only store indices into the arrays.
      If we assume 32 threads with each of them potentially storing 2048 jobs (which is a lot!), you only need 5 bits for the thread ID, and 11 bits for the job index, so exactly 16 bits. I would only store these 16 bits instead of a full Job* in the struct.

      Also, I noticed that in your benchmarks you create 65000 jobs initially. Did you have to bump `MAX_JOB_COUNT` to 65536 for this to work? In which case with 8 threads*128 bytes/Job* 65536 Jobs/thread = 64MB, the memory footprint of this Jobsystem becomes rather large.

      How would you remedy this?

      I wouldn’t make space for so many jobs because you don’t need that. I did that for testing purposes in order to have something that takes a few milliseconds so I can neglect measurement overhead.
      Even in a large AAA game you probably have around 4000-5000 jobs each frame, which roughly amounts to 4MB in an 8-core system. Considering the fact that we only need that on a machine with 8 physical (or at least logical) cores that we can use anyway, I’d consider that to be a beefy machine where even an extra 32MB shouldn’t be a problem.
      And if that still is a problem, you can always trade performance for memory, and use a multi-threaded linear allocator (as described in Part 2) using atomic operations, which means you only need to worry about the max. number of jobs per frame, but not per thread.

      • I had sort of missed the implementation detail that your parallel_for is likely to distribute the jobs over several threads due to stealing in the setup phase. I guess that with 4000-5000 jobs per frame, it is unlikely that all of these are added to one thread’s job queue. Otherwise, the system wouldn’t be very balanced, and possibly lead to a “thundering hurd” of threads always trying to steal from the main thread.

        I was messing with using a 32bit integer for storing job offsets, but I see now that your idea of using only 16 bits would work out nicely. That would certainly free up more space for e.g. member function pointers (which take up alot of space) on a 64 bit system.

        Thanks for the reply Stefan, your blog is pure gold!

  3. I had sort of missed the implementation detail that your parallel_for is likely to distribute the work in the splitting phase, as those jobs are likely to be stolen by the worker threads. That sure takes down the max job number requirement for each thread, and leads to better balancing.

    I was messing with using 32bit offsets/handles for both job continutions and for the parent, and got that working nicely. I see that your idea of using only 16 bit for offsets would work out and give me more space for e.g. member function pointers (which take up alot of space on 64 bit systems).

    Thanks for the reply Stefan, your blog is pure gold!

    • That’s a lot of slides! Once it eventually gets into the using dependencies to give inner parallelism part it gets quite good. I’ve written a similar (toy) system and also needed to add a pretty fancy event system to break dependency chains between systems. It seems like a promising direction to me; but I’ve since started playing with a new language design instead, basically because of all the safety issues. I’d be worried that it would take an unreasonable amount of self control to write hundreds of random little game components in such a system, avoiding globals, statics and such… maybe with extensive code review…

      • Oh I agree, I don’t see using it in a production environment but I did enjoy the focus on the compile time approach.

  4. Hi, I was wondering if you have any ideas how to pass static arguments to the jobs? Lets say that I want to pass DeltaTime to each and every job, I currently have a copy of ParallelFor, JobData etc which specifically takes an extra float and passes it into the function. There must be a better way to do this, I was looking into a way to use a variadic function to pass any amount of parameters into the function but got stuck on the fact that I need to pass it as a variadic into the ParallelFor, save it in the ParallelForJobData and then finally pass it as a variadic into the job function.

    I am still working on getting this kind of general implementation to work but in the meanwhile I was wondering if you had encountered the same “problem” and came to another solution.

    • Hi,

      I think the two reasonable approaches are:

      • Store your extra arguments somewhere in the job’s data, e.g. use an extra 64 bytes per job which can be used to store whatever you want. If I remember correctly, this was the approach taken by certain SPU libraries on the PS3.
      • Use an empty base class and pass a pointer to each job. In the jobs, cast the pointer to the concrete implementation and access your data as usual. This has the drawback of potential heap allocations, but you can write a separate allocator for that.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s