Even though a task scheduler can help with alleviating the burden of having to distribute small pieces of work to different threads, it cannot help preventing a few issues common in multi-threaded programming, especially in multi-processor environments.
Before we start looking into one of those common issues, these are the other articles in the series so far:
- Part 1: Explains the basics behind a task scheduler, work-stealing, load-balancing, etc., and explains a very simple scheduler implementation.
- Part 2: Discusses Molecule’s task model in detail.
- Part 3: Details how parent-child relationships are handled by the scheduler.
Today, we will take a very simple, serial algorithm and try turning it into a multi-threaded implementation using our task scheduler implemented so far.
The example at hand
The example we are going to look at is very simple: given a matrix of N*N size, we want to count the number of odd elements in it. In a serial environment, this can easily be accomplished using the following piece of code:
uint32_t numOdds = 0; for (uint32_t i=0; i<MATRIX_SIZE; ++i) { for (uint32_t j=0; j<MATRIX_SIZE; ++j) { if (data[i*MATRIX_SIZE + j] % 2) ++numOdds; } }
On my PC, it takes around 2.4ms for a 1024×1024 matrix (measured using rdtsc/QueryPerformanceCounter).
Using our scheduler, we can easily turn the serial code into a streaming task by counting the number of odd elements in each row, and then summing those intermediate results to get the total number of odd elements.
This could look like the following:
void SumKernel(const core::scheduler::TaskData& taskData) { int row = taskData.m_streamingData.m_inputStreams[0] - g_memory; int dest = taskData.m_streamingData.m_outputStreams[0] - g_results; for (int i=0; i<MATRIX_SIZE; ++i) { if (g_memory[row*MATRIX_SIZE + i] % 2) ++g_results[dest]; } } core::scheduler::KernelData kernelData(&myKernelData, sizeof(MyKernelData)); core::scheduler::InputStream inputStream(g_memory, sizeof(int)); core::scheduler::OutputStream outputStream(g_results, sizeof(int)); core::scheduler::TaskId id = core::scheduler::AddStreamingTask(kernelData, inputStream, outputStream, MATRIX_SIZE*MATRIX_SIZE, &SumKernel); // wait until all tasks are finished core::scheduler::Wait(id); // sum the results of each row int numOdds = 0; for (int i=0; i<MATRIX_SIZE; ++i) { numOdds += g_results[i]; }
I have changed the underlying streaming task implementation for this example, making sure that each task works on exactly one row of a matrix. The results are simply written to a global integer array, which is defined as follows:
int g_results[1024];
This means that every int sits right next to each other in the array.
The multi-threaded performance using a 4-core i7 system sporting 8 logical cores may be a bit surprising: it takes around 2.3ms. That’s 0.1ms faster than the serial algorithm – this clearly can’t be the multi-core revolution everybody’s talking about!
Why does the algorithm scale so badly?
The reason for this kind of crappy performance on multi-processor systems is caused by something known as false sharing. Because processors only ever read whole cache-lines from memory into the cache, writing one integer on e.g. core 1 causes a different core to re-read adjacent integers into the cache again, if both integers happen to lie on the same cache line. If the cache-line size is 128 bytes, this means that each group of 32 ints will trigger this kind of behaviour in multi-processor systems, deteriorating performance and scalability.
We can easily test whether we are hit by false sharing by slightly changing our implementation, making sure that each integer lies on its own cache-line:
// assuming that g_results is aligned to 128-byte boundaries int g_results[1024*32]; // changed loop implementation in the kernel function for (int i=0; i<MATRIX_SIZE; ++i) { if (g_memory[row*MATRIX_SIZE + i] % 2) ++g_results[dest*32]; } // changed summation over results int numOdds = 0; for (int i=0; i<MATRIX_SIZE; ++i) { numOdds += g_results[i*32]; }
Note the *32 which makes sure that each int lies on its own cache-line (32*sizeof(int) = 128).
If we measure the performance again, this time around it takes 0.9ms for the algorithm to finish. Things certainly have improved over the serial implementation!
The observant reader will have noticed that going from 2.4ms to 0.9ms is not really what we might have expected from a 4-core system, but in my opinion that’s mainly due to two reasons:
- The i7 does a really good job at hiding performance penalties caused by crappy code with all its built-in hardware prefetching and the L3 cache. Other systems show far worse performance in situations like this.
- The example we used is completely memory-bound, which is not the kind of algorithm making many-core systems shine.
General advice
Because you can never be sure whether you will be hit by false sharing when writing code like this (the environment you’re running in is mostly an unknown on the PC), I would always recommend to use local variables for any kind of intermediate results, writing back the results into memory only once. This can easily be achieved with the following changes to the original loop in the kernel function:
// changed loop implementation, using a local variable int numOdds = 0; for (int i=0; i<MATRIX_SIZE; ++i) { if (g_memory[row*MATRIX_SIZE + i] % 2) ++numOdds; } g_results[dest] = numOdds;
Local variables will never be subject to false sharing, or any other kind of shared access to memory, hence this is the preferred way of writing code. It also helps in other situations where you might suffer the penalties caused by a load-hit-store (e.g. when using memory-bound variables such as references/pointers), so you should get used to writing code like this if you aren’t already doing it. Note that class members also alias each other and input variables through the this-pointer, which can introduce load-hit-stores in totally innocent looking code without you ever noticing – another argument in favor of using local variables!
The example we looked at is the one used by Scott Meyers in his talk “CPU caches and why you care” – I would highly recommend reading his slides, or watching the talk. His measurements also suggest that the hit caused by not treating your caches nicely can be far worse on other, less powerful systems. There’s also Herb Sutter’s DDJ article about eliminating false sharing.
Conclusion
Watch your cache usage, use local variables where possible, check the code generated by the compiler from time to time to ensure that you’re not running into unexpected issues caused by aliasing (class members!) or false sharing.
Always keep in mind that processors work with cache-lines only, and that some architectures require you to know about the underlying implications. As an example, using LL/SC instructions on PowerPC architectures (e.g. consoles) only works if you are working with variables aligned to the processor’s cache-line size – but that’s an entirely different story altogether…
Excellent series! Can’t wait to see the next part 🙂
Pingback: Job System 2.0: Lock-Free Work Stealing – Part 1: Basics | Molecular Musings