Job System 2.0: Lock-Free Work Stealing – Part 3: Going lock-free

Posted on September 25, 2015 by Stefan Reinalter

This week, we will finally tackle the heart of the job system: the implementation of the lock-free work-stealing queue. Read on for a foray into low-level programming.

Recap

Remember the three operations that the work-stealing queue needs to offer:

Push(): Adds a job to the private (LIFO) end of the queue.
Pop(): Removes a job from the private (LIFO) end of the queue.
Steal(): Steals a job from the public (FIFO) end of the queue.

Further remember that Push() and Pop() are only called by the thread that owns the queue, and thus never concurrently. Steal() can be called by any other thread, at any time, concurrently with both Push() and Pop().

A locking implementation

Before delving into the realm of lock-free programming, let’s first work on an implementation that uses traditional locks. Conceptually, how do we build a data structure that acts like a double-ended queue (deque), with one end behaving like a LIFO, and the other end behaving like a FIFO?

Fortunately, this can quite easily be solved by having two indices that denote the two ends of the deque, as described in “D. Chase and Y. Lev., Dynamic circular work-stealing deque”. If we assume for a moment that we have infinite memory, and therefore an array with an unlimited number of entries, we can introduce two counters called bottom and top that possess the following properties:

bottom indicates the next available slot in the array to where the next job is going to be pushed. Push() first stores the job, and then increments bottom.
Similarly, Pop() decrements bottom, and then returns the job stored at that position in the array.
top indicates the next element that can be stolen (i.e. the topmost element in the deque). Steal() grabs the job from that slot in the array, and increments top.

An important observation to make is that at any given point in time, bottom – top yields the number of jobs currently stored in the deque. If bottom is less than or equal to top, the deque is empty and there is nothing to steal.

Another interesting fact is that both Push() and Pop() only change bottom, and Steal() only changes top. This is a very important property that minimizes synchronization overhead for the owner of the work-stealing queue, and proves to be advantageous for a lock-free implementation.

To better illustrate how the deque works, take a look at the following list of operations performed on the deque:

Operation	bottom	top	size (bottom – top)
Empty deque	0	0	0
Push	1	0	1
Push	2	0	2
Push	3	0	3
Steal	3	1	2
Pop	2	1	1
Pop	1	1	0

As described earlier, Push() and Pop() work in LIFO fashion, while Steal() works in FIFO fashion. In the example above, the first call to Steal() returns the job at index 0, while subsequent calls to Pop() return the jobs at index 2 and 1, in that order.

In C++ code, the implementation of the three operations for the work-stealing queue could look as follows:

void Push(Job* job)
{
	ScopedLock lock(criticalSection);

	m_jobs[m_bottom] = job;
	++m_bottom;
}

Job* Pop(void)
{
	ScopedLock lock(criticalSection);

	const int jobCount = m_bottom - m_top;
	if (jobCount <= 0)
	{
		// no job left in the queue
		return nullptr;
	}

	--m_bottom;
	return m_jobs[m_bottom];
}

Job* Steal(void)
{
	ScopedLock lock(criticalSection);

	const int jobCount = m_bottom - m_top;
	if (jobCount <= 0)
	{
		// no job there to steal
		return nullptr;
	}

	Job* job = m_jobs[m_top];
	++m_top;
	return job;
}

The only thing left to do for now is turning the unbounded, infinite array into a circular array. This could be accomplished by wrapping bottom and top accordingly, e.g. using a modulo operation. But that would make it much harder to calculate the number of jobs currently in the deque, because we would have to account for wrap-arounds. A better and simpler solution is to apply the modulo operation to bottom and top when accessing the array. As long as the number of elements stored in the deque is a power-of-two, this is nothing more than a binary AND operation:

static const unsigned int NUMBER_OF_JOBS = 4096u;
static const unsigned int MASK = NUMBER_OF_JOBS - 1u;

void Push(Job* job)
{
	ScopedLock lock(criticalSection);

	m_jobs[m_bottom & MASK] = job;
	++m_bottom;
}

Job* Pop(void)
{
	ScopedLock lock(criticalSection);

	const int jobCount = m_bottom - m_top;
	if (jobCount <= 0)
	{
		// no job left in the queue
		return nullptr;
	}

	--m_bottom;
	return m_jobs[m_bottom & MASK];
}

Job* Steal(void)
{
	ScopedLock lock(criticalSection);

	const int jobCount = m_bottom - m_top;
	if (jobCount <= 0)
	{
		// no job there to steal
		return nullptr;
	}

	Job* job = m_jobs[m_top & MASK];
	++m_top;
	return job;
}

And there you have it, an implementation of a work-stealing queue using traditional locks.

Prerequisites: Lock-free programming

Lock-free programming is a huge topic, and much has been written about it already. Rather than repeating what has been said, I would like to recommend a few good articles I’d like everybody to read before continuing with this post:

An Introduction to Lock-Free Programming: Touches upon techniques, atomic operations, sequential consistency, and memory ordering. It’s a fantastic “Lock-Free Programming 101” must-read!
Memory Ordering at Compile Time: Talks about compiler reordering and memory barriers, with insightful examples.
Memory Reordering Caught in the Act: Shows an example of memory reordering happening on x86/64, even though Intel’s x86/64 architecture offers a strongly ordered memory model.
Weak vs. Strong Memory Models: Discusses weak and strong memory models, and sequential consistency.

In general, Jeff Preshing’s blog is a gold mine full of articles about lock-free programming. If you’re ever in doubt, check his blog.

A lock-free work-stealing queue

Finished reading all the articles? Good.

Assuming no compiler reordering and absolutely no memory reordering for now, let’s implement all three operations in a lock-free manner, one by one. We will discuss the needed compiler and memory barriers later.

Consider a lock-free implementation of Push() first:

void Push(Job* job)
{
	long b = m_bottom;
	m_jobs[b & MASK] = job;
	m_bottom = b+1;
}

What could happen in terms of other operations happening concurrently? Pop() cannot be executed concurrently, so we only need to consider Steal(). However, Steal() only writes to top, and reads from bottom. So the worst that could happen is that Push() gets pre-empted by a call to Steal() right before signaling the availability of a new item in line #5. This is of no concern, because it only means that Steal() could not steal an item that would have been there already – no harm done.

Next in line, the Steal() operation:

Job* Steal(void)
{
	long t = m_top;
	long b = m_bottom;
	if (t < b)
	{
		// non-empty queue
		Job* job = m_jobs[t & MASK];

		if (_InterlockedCompareExchange(&m_top, t+1, t) != t)
		{
			// a concurrent steal or pop operation removed an element from the deque in the meantime.
			return nullptr;
		}

		return job;
	}
	else
	{
		// empty queue
		return nullptr;
	}
}

As long as top is less than bottom, there are still jobs left in the queue to steal. If the deque is not empty, the function first reads the job stored in the array, and then tries to increment top by using a compare-and-swap operation. If the CAS fails, a concurrent Steal() operation successfully removed a job from the deque in the meantime.

Note that it is important to read the job before carrying out the CAS, because the location in the array could be overwritten by concurrent Push() operations happening after the CAS has completed.

A very crucial thing to note here is that top is always read before bottom, ensuring that the values represent a consistent view of the memory. Still, a subtle race may occur if the deque is emptied by a concurrent Pop() after bottom is read and before the CAS is executed. We need to ensure that no concurrent Pop() and Steal() operations both return the last job remaining in the deque, which is achieved by also trying to modify top in the implementation of Pop() using a CAS operation.

The Pop() operation is the most interesting of the bunch:

Job* Pop(void)
{
	long b = m_bottom - 1;
	m_bottom = b;

	long t = m_top;
	if (t <= b)
	{
		// non-empty queue
		Job* job = m_jobs[b & MASK];
		if (t != b)
		{
			// there's still more than one item left in the queue
			return job;
		}

		// this is the last item in the queue
		if (_InterlockedCompareExchange(&m_top, t+1, t) != t)
		{
			// failed race against steal operation
			job = nullptr;
		}

		m_bottom = t+1;
		return job;
	}
	else
	{
		// deque was already empty
		m_bottom = t;
		return nullptr;
	}
}

In contrast to the implementation of Steal(), this time around we need to ensure that we first decrement bottom before attempting to read top. Otherwise, concurrent Steal() operations could remove several jobs from the deque without Pop() noticing.

Additionally, if the deque was already empty, we need to reset it to a canonical empty state where bottom == top.

As long as there are still several jobs left in the deque, we can simply return it without doing any additional atomic operations. However, as pointed out in the implementation notes above, we need to protect the code against races from concurrent calls to Steal() in case there is only one job left.

If so, the code carries out a CAS to increment top and check whether we won or lost a race against a concurrent Steal() operation. There are only two possibilities to the outcome of this operation:

The CAS succeeds and we won the race against Steal(). In that case, we set bottom = t+1 which sets the deque into a canonical empty state.
The CAS fails and we lost the race against Steal(). In that case, we return an empty job, but still set bottom = t+1. Why? Because loosing the race implies that a concurrent Steal() operation successfully set top = t+1, so we still have to set the deque into an empty state.

Adding compiler and memory barriers

So far, so good. In its current form, the implementation won’t work because we completely left compiler and memory ordering out of the picture. This needs to be fixed.

Consider Push():

void Push(Job* job)
{
	long b = m_bottom;
	m_jobs[b & MASK] = job;
	m_bottom = b+1;
}

Nobody guarantees that the compiler won’t reorder any of the statements above. Specifically, we cannot be sure that we first store the job in the array, and then signal this to other threads by incrementing bottom – it could be the other way around, which would lead to other threads stealing jobs which aren’t there yet!

What we need in this case is a compiler barrier:

void Push(Job* job)
{
	long b = m_bottom;
	m_jobs[b & MASK] = job;

	// ensure the job is written before b+1 is published to other threads.
	// on x86/64, a compiler barrier is enough.
	COMPILER_BARRIER;

	m_bottom = b+1;
}

Note that on x86/64 a compiler barrier is enough, because the strongly ordered memory model does not allow stores to be reordered with other stores. On other platforms (PowerPC, ARM, …) you would need a memory fence instead. Furthermore, notice that the store operation also doesn’t need to be carried out atomically in this case, because the only other operation writing to bottom is Pop(), which cannot be carried out concurrently.

Similarly, we also need a compiler barrier in the implementation of Steal():

Job* Steal(void)
{
	long t = m_top;

	// ensure that top is always read before bottom.
	// loads will not be reordered with other loads on x86, so a compiler barrier is enough.
	COMPILER_BARRIER;

	long b = m_bottom;
	if (t < b)
	{
		// non-empty queue
		Job* job = m_jobs[t & MASK];

		// the interlocked function serves as a compiler barrier, and guarantees that the read happens before the CAS.
		if (_InterlockedCompareExchange(&m_top, t+1, t) != t)
		{
			// a concurrent steal or pop operation removed an element from the deque in the meantime.
			return nullptr;
		}

		return job;
	}
	else
	{
		// empty queue
		return nullptr;
	}
}

Here, we need a compiler barrier to ensure that the read of top truly happens before the read to bottom. Additionally, we would need another barrier that guarantees that the read from the array is carried out before the CAS. In this case however, the interlocked function acts as a compiler barrier implicitly.

On more operation left to go:

Job* Pop(void)
{
	long b = m_bottom - 1;
	m_bottom = b;

	long t = m_top;
	if (t <= b)
	{
		// non-empty queue
		Job* job = m_jobs[b & MASK];
		if (t != b)
		{
			// there's still more than one item left in the queue
			return job;
		}

		// this is the last item in the queue
		if (_InterlockedCompareExchange(&m_top, t+1, t) != t)
		{
			// failed race against steal operation
			job = nullptr;
		}

		m_bottom = t+1;
		return job;
	}
	else
	{
		// deque was already empty
		m_bottom = t;
		return nullptr;
	}
}

The most important part of this implementation are the first three lines. This is one of the rare cases where we need a true memory barrier, even on Intel’s x86/64 architecture. More specifically, adding just a compiler barrier between the store to bottom = b and the read from long t = top is not enough, because the memory model explicitly allows that “Loads may be reordered with older stores to different locations”, which is exactly the case here.

This means that the first few lines of code need to be fixed as follows:

long b = m_bottom - 1;
m_bottom = b;

MEMORY_BARRIER;

long t = m_top;

Note that neither SFENCE nor LFENCE would suffice in this case, and it really needs to be a full MFENCE barrier. Alternatively, we can also use an interlocked operation such as XCHG instead of a full barrier. XCHG acts like a barrier internally, but turns out to be a tiny bit cheaper in this case:

long b = m_bottom - 1;
_InterlockedExchange(&m_bottom, b);

long t = m_top;

The rest of the implementation of Pop() can stay as it is. Similar to Steal(), the CAS operation serves as a compiler barrier, and the store to bottom = t+1 does not need to happen atomically because there can’t be any concurrent operation that also writes to bottom.

Performance

What did the lock-free implementation gain us in terms of performance? Here’s an overview:

	Basic	Thread-local allocator	Lock-free deque	Increase vs. first impl.	Increase vs. second impl.
Single jobs	18.5 ms	9.9 ms	2.93 ms	6.31x	3.38x
parallel_for	5.3 ms	1.35 ms	0.76 ms	6.97x	1.78x

Compared to our first implementation, going lock-free and using thread-local allocators gave us almost 7x the performance. Even on top of using thread-local allocators, the lock-free implementation performs between 1.8x and 3.3x times faster.

Outlook

That’s it for today. Next time, we will see how high-level algorithms like parallel_for can be implemented on top of this system.

31 thoughts on “Job System 2.0: Lock-Free Work Stealing – Part 3: Going lock-free”

Peter on September 26, 2015 at 11:46 am said:

Reading “Work Stealing” and “Lock Free” reminds me of http://actor-framework.org ?
Do you know this framework? Would be interesting how performance compares to it.

Reply ↓
- Stefan Reinalter on September 28, 2015 at 11:29 pm said:
  
  No, haven’t heard of it before. Need to take a proper look at it.
  
  Reply ↓
mehdi on September 27, 2015 at 1:09 pm said:

First, i would like to thank you for the great amazing awsome tutorials,
i have a question, you are only incrementing top & bottom but at some point they will reach their limit and wrap around, so your condition will fail (please correct me if im wrong), what i sujest is to use top != bottom instead of bottom < top, or keep track of the size, but this means other atomic add & sub, or reset them at the end of the frame.

Reply ↓
- Stefan Reinalter on September 28, 2015 at 11:37 pm said:
  
  i have a question, you are only incrementing top & bottom but at some point they will reach their limit and wrap around, so your condition will fail (please correct me if im wrong)
  
  That is absolutely true.
  One solution is to reset both bottom and top at the end of a frame just like you said, the other solution is to use 64-bit integers. With 64-bit integers, you can push and steal 10.000 jobs per frame, running at 60Hz for almost a million years (if my math isn’t off).
  
  what i sujest is to use top != bottom instead of bottom < top
  
  No, that would be wrong. You need to be certain that there are N jobs left in the queue before attempting a Pop() or Steal() operation.
  
  Reply ↓
  - mehdi on September 30, 2015 at 8:04 pm said:
    
    First, thanks for your reply,
    _InterlockedCompareExchange takes a signed 32 bit integer and not a 64 bit, more over in a 32 bit environment 64 bit values are not guaranteed to have atomic writes, however, Microsoft introduces _InterlockedCompareExchange64 variant that takes a 64 bit signed integer, im just wondering is this a real engine code or just a tutorial pseudo code.
    Also when you steal a job you only take a pointer to the job, it could happen that before you execute it, it gets replaced by the queue owner, and then the same job will be executed twice, shouldnt you copy it by value and not the address ?
  - Stefan Reinalter on October 1, 2015 at 8:02 am said:
    
    _InterlockedCompareExchange takes a signed 32 bit integer and not a 64 bit, more over in a 32 bit environment 64 bit values are not guaranteed to have atomic writes, however, Microsoft introduces _InterlockedCompareExchange64 variant that takes a 64 bit signed integer
    
    That’s true, and I currently use 32-bit integers in my implementation.
    
    im just wondering is this a real engine code or just a tutorial pseudo code
    
    Trust me that this is real engine code, stripped from heavy comments and assertions.
    
    Also I can’t help but feel slightly offended by this. I’m putting out source code for everybody to use, and even if it was tutorial code (whatever that has to say about the quality), you should be glad that it’s there, for you to use and improve upon. Feel free to point out mistakes and bugs, and I’ll gladly update the post.
    
    Also when you steal a job you only take a pointer to the job, it could happen that before you execute it, it gets replaced by the queue owner, and then the same job will be executed twice, shouldnt you copy it by value and not the address ?
    
    No, that cannot happen.
    The work-stealing queue only stores pointers to jobs, and not the jobs themselves. They are all allocated using a specialized allocator, as discussed in Part 2.
    So each Job* that is returned is completely unique and accessible until the underlying allocation is freed.
  - mehdi on October 1, 2015 at 10:36 am said:
    
    “Also I can’t help but feel slightly offended by this”
    Please don’t get me wrong, i really love your blog and i have been waiting for a long time after the stateless series for you to come back, that series was really awsome and it was the only tut i could find about multithreaded rendering, and iv e read your blogs more than just once, you present a very very detailed and import materials for free, matter of fact, i can’t even judge the quality of your code as i’m only a beginner, and im very very glad that the code is there along with the awsome explenation, i was just wondering about the source because im very interseted in your work and how experts write code that is all, maybe i just didnt transmit it well probably because im not a native speaker and im not quite open on your culture, sorry if that offended by any mean you but i really really didnt mean any thing.
  - Stefan Reinalter on October 1, 2015 at 1:33 pm said:
    
    Don’t worry, it’s all fine!
    
    I’m sorry, I overreacted. I should have known better that misunderstandings like this are caused by language barriers, especially in written form. So, I apologize.
    
    Please feel free to ask more questions about the implementation or my explanations, if they seem unclear.
Ben Hymers on September 28, 2015 at 1:56 pm said:

“Note that it is important to read the job before carrying out the CAS, because the location in the array could be overwritten by concurrent Push() operations happening after the CAS has completed.”

Is that only because bottom can catch up with top and start overwriting old jobs? Isn’t that a problem worth addressing in itself?

Also, when Pop returns a job when there are multiple jobs left (after the comment ‘// there’s still more than one item left in the queue’), couldn’t multiple Steal calls happen before the job is returned, leaving Pop returning a job that has been stolen?

Reply ↓
- Stefan Reinalter on September 28, 2015 at 11:55 pm said:
  
  “Note that it is important to read the job before carrying out the CAS, because the location in the array could be overwritten by concurrent Push() operations happening after the CAS has completed.”
  
  Is that only because bottom can catch up with top and start overwriting old jobs? Isn’t that a problem worth addressing in itself?
  
  What could happen is that after the CAS completes, some other thread does so many Push() operations that the deque is now completely full, and the slot of the job you wanted to return has now been overwritten. Note that a job itself is never overwritten, but only the pointer to it that’s stored in the array. That is the reason why it’s important to fetch the (pointer to the) job from memory before executing the CAS.
  Regarding top catching up to bottom: a deque can only hold N elements at the same time, which is trivially enforced in the allocator described in Part 2. So if every worker thread is able to allocate N job instances per frame, it can never hold more than N jobs in the deque.
  
  Also, when Pop returns a job when there are multiple jobs left (after the comment ‘// there’s still more than one item left in the queue’), couldn’t multiple Steal calls happen before the job is returned, leaving Pop returning a job that has been stolen?
  
  No, that can’t happen. While it can happen that several jobs are stolen right after a call to Pop(), the one that’s going to get returned by Pop() in this case can’t be stolen, because of the first two lines in the implementation. Whenever we enter that branch of the if-statement (// non-empty queue), we already told all other threads that there is now one less job to steal. That is the exact reason why it’s utterly important that bottom = bottom - 1 gets published to all other threads before doing anything else in Pop(). This is the signal that makes sure that whenever there is more than one item left in the queue, this one can never be stolen in the meantime.
  
  If you were to replace the MEMORY_BARRIER with a simple COMPILER_BARRIER in this case, the situation you described could indeed happen, because the code could then be read as “read t first, then decrement bottom” by the other CPUs – which would be wrong for the reasons you stated. Again, I can’t stress enough how important this MEMORY_BARRIER (or interlocked operation) is.
  
  Reply ↓
shade on September 30, 2015 at 5:22 pm said:

First of all, let me thank you for this great blog I unfortunately haven’t discovered earlier.

I have some questions concerning atomic operations. You are using 64-bit Integers for top and bottom. And isn’t it important that reads and writes on these values need to be atomic for the deque to work properly? Because if they are not then Pop() could write to bottom and if Steal() reads at the same time this could lead to an unpredictable result. How is this ensured, especially on x86 platforms? And isn’t memory alignment also an issue in this context?

Reply ↓
- Stefan Reinalter on September 30, 2015 at 6:00 pm said:
  
  And isn’t it important that reads and writes on these values need to be atomic for the deque to work properly?
  
  The operations that need to be atomic are atomic already, or did I overlook something?
  
  Because if they are not then Pop() could write to bottom and if Steal() reads at the same time this could lead to an unpredictable result.
  
  As far as I know, the implementation does not have unpredictable results, because operations are atomic already in places where they need to be. Regarding Pop(), this is ensured by either using a memory barrier, or an interlocked function (= atomic operation).
  
  How is this ensured, especially on x86 platforms? And isn’t memory alignment also an issue in this context?
  
  Generally speaking, all functions from the Interlocked* family on Windows are atomic operations. Alignment restrictions are also described in the documentation of the corresponding interlocked function.
  
  Please check the post and all comments again. Is there any situation in particular where you think data races between concurrent operations like Pop() and Steal() could happen? If so, I would be happy to discuss those!
  
  Reply ↓
- Noyan Gunday on November 7, 2015 at 5:01 am said:
  
  https://msdn.microsoft.com/en-us/library/ee959491.aspx
  
  Data types that are larger than 4 bytes are not automatically aligned on the stack when you use the x86 compiler to compile an application. Because the architecture for the x86 compiler is a 4 byte aligned stack, anything larger than 4 bytes, for example, a 64-bit integer, cannot be automatically aligned to an 8-byte address.
  
  from https://msdn.microsoft.com/en-us/library/windows/desktop/ms683562(v=vs.85).aspx
  
  The variables for this function must be aligned on a 64-bit boundary; otherwise, this function will behave unpredictably on multiprocessor x86 systems and any non-x86 systems. See _aligned_malloc.
  
  Reply ↓
Pingback: Job System 2.0: Lock-Free Work Stealing – Part 4: parallel_for | Molecular Musings
Pingback: Job System 2.0: Lock-Free Work Stealing – Part 5: Dependencies | Molecular Musings
Fabio on April 19, 2016 at 2:06 am said:

I have a doubt for the lock-free implementation of Pop() function. You are storing off t, and than comparing t != b. Although, the Steal() function can be called multiple times between line 6 and 11 up to empty the queue, effectively making the instructions at line 10 (Job* job = m_jobs[b & MASK]) returning a job that could have been already stolen. What am I missing here?

Reply ↓
- Stefan Reinalter on April 19, 2016 at 11:05 am said:
  It’s tricky.
  Note that the first condition for entering the if is (t <= b), which ensures that the queue is not empty. Only then do we read m_jobs[b & MASK] which is OK, because we’re not using that yet. Only in the case (t != b) do we return that job.
  
  If you imagine what could happen with concurrent Steal() operations, there are basically 2 things:
  - The thread that runs Pop() gets pre-empted before the m_bottom = b operation has been published to other threads. This means that other Steal() operations can steal the job in the meantime, which is not a problem as long as the last remaining job hasn’t been stolen. In case the last job was stolen, the InterlockedCompareExchange in line 18 will fail, and we won’t return a job.
  - The thread that runs Pop() gets pre-empted after the m_bottom = b operation has been published to other threads. This means that we’ve already “claimed” this element from the queue, and other Steal() operations won’t ever be able to steal that item.
  What you might be missing: note that the order of operations in Pop() is first read (and changing) bottom, then reading top. The order of operations in Steal() is first reading top, then reading bottom. This is crucial, because it ensures that they “lock each other out” in potentially problematic cases.
  
  Can you come up with any random interleaving of concurrent operations in Pop() and Steal() that would cause a problem?
  Reply ↓
Paul on October 25, 2016 at 10:24 pm said:

Hi Stefan,

thanks for this great series. Like many others, I am also trying to implement the system outlined in the blog by filling in the details you left out. Right now, I am struggling with the WorkStealingQueue::Pop() method. Like you, I am using unsigned variables for the top and bottom indices. During construction, I initialize there to 0. If there are no jobs in the queue yet, and we perform a Pop(), one is subtracted from bottom, causing it to flip to ~0. The subsequent check (t <= b) succeeds and invalid Job data is returned. Initializing to 1 or any higher value helps, but I still encounter situations where top and bottom become invalid. Any advice?

Also: instead of using a compiler or memory barrier, could I use a volatile long? This seems to be recommended by Microsoft in VS2015. See: https://msdn.microsoft.com/en-us/library/65tt87y8.aspx

Finally: I am having a bit of trouble figuring out how to write the GetWorkerThreadQueue().

Thanks for any advice,

Paul

Reply ↓
- Stefan Reinalter on October 26, 2016 at 11:09 am said:
  
  Hi,
  
  Like you, I am using unsigned variables for the top and bottom indices. During construction, I initialize there to 0. If there are no jobs in the queue yet, and we perform a Pop() […]
  
  I don’t use unsigned variables, but signed integers. Additionally, you should make sure to only let threads call Pop() or Steal() when there are some N jobs in the queue, otherwise threads will constantly try to access the queue.
  
  Also: instead of using a compiler or memory barrier, could I use a volatile long? This seems to be recommended by Microsoft in VS2015. See: https://msdn.microsoft.com/en-us/library/65tt87y8.aspx
  
  You could, but I would refrain from doing so.
  volatile is a keyword that was never intended for multi-threaded environments, but rather hardware access (see my post). If you use what Microsoft calls /volatile:iso compilation mode, you will have *very* subtle race conditions on architectures with a weak memory model. If you use /volatile:ms, you will be given stronger guarantees for each and every volatile access (both reads and writes) – those guarantees are not given by other compilers, hence breaking the code. Additionally, it will make the code slower which is noticeable in contended scenarios, effectively destroying what we did by going lock-free in the first place.
  
  If you don’t want to use compiler-specific instructions, C++11 atomics are the way to go. Just make sure to look into acquire & release semantics because you don’t want full sequential consistency on every access.
  
  Finally: I am having a bit of trouble figuring out how to write the GetWorkerThreadQueue().
  
  Either use thread-local storage, or a pointer to each thread’s queue that is handed to each job by the system.
  
  Reply ↓
  - Paul on October 26, 2016 at 12:59 pm said:
    
    Got it, thanks for your reply!
    
    I was able to get a working version, but it only works in Debug mode, which suggests problems with compiler optimizations. Might be caused by my choice of using volatile integers. I’ll try to stick to the info in the blog. I’m sure I can figure it out.
    
    Regarding the top/bottom variables: I noticed that if I initialize them to a value equal to or larger than the number of threads (including the main thread), neither of them will ever flip to ~0. But I think I’ll use signed integers instead.
Tony on March 27, 2017 at 4:48 am said:

I don’t understand ‘m_bottom = t+1;’

If there are 7 items in the queue, say, top is 3 and bottom is 10, why pop one and set bottom to 4?

Reply ↓
- Tony on March 28, 2017 at 12:29 am said:
  
  Sorry. Missed the `if (t != b)` higher up that takes care of that situation.
  
  Reply ↓
Bobby on July 29, 2017 at 2:01 am said:

Hi, if we’re using this deque algorithm in a case where the maximum number of items is not externally enforced, what is the proper way to check in Push() that the ringbuffer is not full so as not to overwrite item (pointers)? Does it matter if top is read first vs bottom, or does the order not matter (either for correctness, and for efficiency)? It seems to me that the order doesn’t matter though might be slightly more efficient to read top after (as it gives chance for a Steal() to move top and free up a slot).

Reply ↓
- Stefan Reinalter on August 1, 2017 at 11:41 am said:
  
  I think this is an architectural problem. If you’re going to call Push() with no space left in the ringbuffer, what are you going to do?
  If you want to assert that there is still space left (and throw an error in case it’s not), the order doesn’t matter if you’re OK with the assert throwing false positives, e.g. reporting that there was no more space left while a Steal() operation created a free slot in the meantime. If you want to assert without false positives, it gets more tricky because you need to protect the assertion against concurrent Steal() operations.
  
  Reply ↓
Brett Knapik-Blaisure on December 18, 2017 at 12:30 am said:

Hi Stefan,

Thank you for your wonderful set of articles on your blog as they are informative and educational. I noticed you posted the following earlier:

“If you don’t want to use compiler-specific instructions, C++11 atomics are the way to go. Just make sure to look into acquire & release semantics because you don’t want full sequential consistency on every access.”

So this sent me down the path of using the C++ 11 atomic library and learning about acquire and release semantics. This is a very deep subject however, the best resource I could find is a 3 hour talk by herb sutter at channel 9: https://channel9.msdn.com/Shows/Going+Deep/Cpp-and-Beyond-2012-Herb-Sutter-atomic-Weapons-1-of-2

as well as Phreshings talk from CppCon 2014: https://www.youtube.com/watch?time_continue=4&v=X1T3IQ4N-3g

If anyone else decides to go down that path make sure you invest the time to research the topic.

Stefan:

Have you considered updating this post or making another post on how to implement the lock free queue with the C++ 11 atomics?

Reply ↓
- Stefan Reinalter on December 18, 2017 at 1:19 pm said:
  
  Thanks for the kind words, glad you like the blog!
  No, I haven’t considered updating the post with C++11 atomics. Maybe you want to do it? I’ll put up a link to your github page or similar if you want.
  
  Reply ↓
yukevin on May 18, 2019 at 6:56 pm said:

Hi, I have read the Chales-Lev paper. Both in the paper and your post above, It is stated that the reason we have to read top first before bottom is to have “consistent view of memory”. What does it mean by consistent view of memory. Is there any resource that I can read regarding this concept?
I have find the reason why I have to read top first before bottom by finding an example of sequences of execution that is problematic if we read bottom first before top. The way I think about it is if we read bottom first before top when doing steal operation. There is a chance that a pop operation might empty the dequeue in between these two operations. Then after the dequeue is empty we read the top variable which is up to date so CAS will be successful. But if we read the top variable first, there is no way to empty the queue without modifying top, so CAS will fail if the queue is empty.
But I don’t know what all this related to “consistent view of memory”. I feel like this is an important concept that I may not miss.

Reply ↓
- Stefan Reinalter on June 4, 2019 at 2:36 pm said:
  
  What I meant by “consistent view of memory” was exactly what you described.
  Reading the variables in this order ensures that there can’t be an operation that says “the deque is empty” while a concurrent operation says “the deque is not empty”.
  
  Reply ↓
  - Kevin Yudi Utama on June 13, 2019 at 1:28 pm said:
    
    HI, this might be interesting to you, but Anthony Williams find a data race on this deque when there is wrap arround. How do you prevent the wrap arround? Do you wait for all jobs on this frame to finish before pushing the job for the next frame?
    More detail about the data race: https://stackoverflow.com/questions/56574744/googles-workstealingdequeue-use-memory-order-seq-cst-as-full-memory-barrier-is#comment99736478_56578648.
  - Stefan Reinalter on June 17, 2019 at 2:56 pm said:
    
    Yes, I wait until all jobs are finished before pushing jobs for the next frame into the queue.
    Similarly, you have to make sure that you never have more jobs than can fit into the queue.
Pingback: Naive lock free work stealing queue - PhotoLens