Memory system – Part 4

In the last few installments of the Memory system series, we were mostly concerned with how ordinary memory allocation using new and delete works internally, and how we can build our own tools to provide additional functionality. One thing we haven’t discussed yet are the memory arenas which are responsible for actually allocating memory, while providing additional things like bounds checking, memory tracking, etc.

In this post, we’ll look at different memory allocation schemes and use cases, and try to build our memory arena system keeping those in mind.

First of all, we did not build our own memory allocation utilities in order to replace a general purpose allocator with another general purpose allocator. There are several reasons for this:

  • Nowadays, most general purpose allocators in operating systems either use Doug Lea’s malloc (which is still one of the best allocators around even though it was introduced more than 10 years ago), or others like ptmalloc, tcmalloc, Hoard. It is very hard to beat them, so don’t even try.
  • Memory allocation in games can be made faster because we know exactly when and where we allocate, and in which context. Not every allocator needs to be general purpose, and not every allocator needs to be thread-safe – quite the contrary.
  • Even if you enhance the general purpose allocator with some tricks suited for games (e.g. using memory pools for small allocations), it still doesn’t quite cut it in the grand scheme of things – I’ve seen up to 200x speed-ups just by using a single-threaded linear memory allocator instead of the general purpose one (which already employed several performance-improving tricks).
  • Bounds checking, memory tracking and memory marking are not of global concern, hence they shouldn’t be built into the general purpose allocator. We don’t need to track each and every memory allocation if we already know which sub-system is leaking memory.

To get a better picture of what I am talking about, let’s take a look at one possible implementation of a memory allocation routine which is used as a replacement for the general purpose allocator:

void* TheMemoryAllocator::Allocate(size_t bytes, ...)
{
  // enter critical section, mutex, etc.

#if BOUNDS_CHECKING_ENABLED
  bytes = TheBoundsChecker::AdjustSize(bytes);
#endif

  // do general purpose allocation
  void* memory = ...;

#if MEMORY_TRACKING_ENABLED
  // grabs the callstack internally, rather slow
  TheMemoryTracker::AddAllocation(memory);
#endif

#if MEMORY_MARKING
  TheMemoryMarker::Mark(memory);
#endif

  // leave critical section, mutex, etc.

  return memory;
}

The above is definitely an improvement to the built-in allocator, because it adds debugging features like bounds checking and memory tracking without having to rely on external applications, and you can certainly ship games with it. But still, it has a few quirks which just rub me the wrong way:

  • Bounds checking and memory tracking are now global. You can turn them on and off using the preprocessor macro MEMORY_TRACKING_ENABLED, but only for the whole application. This leads to situations where you enable memory tracking, rebuild your code, start the application – only to find out that the game now needs 10 minutes to start (grabbing a full callstack is rather slow) and runs out of extra dev-kit memory because of the overhead caused by the memory tracker. Not a useful feature as such.
  • There’s only one kind of memory tracking available – if we want to add additional ones, the code becomes a mess of preprocessor #if/#endif clauses rather quickly. We certainly don’t want to grab a full callstack for each and every allocation just for tracking purposes.
  • Thread-safety is global now. Many allocations don’t need to be thread-safe, and there are different methods of implementing thread-safety (a linear allocator can easily be lock-free or wait-free), without having to rely on heavy-duty synchronization primitives like mutexes or critical sections.

Having said what we don’t want, let’s identify some real-world use cases for memory allocations happening during the lifetime of a game:

  • Linear allocations, e.g. all allocations having application lifetime. They are allocated once in the beginning, and are freed upon exiting the game. They are fixed-size, can be allocated in a linear fashion, and normally there are no threads involved, hence they can be single-threaded. Still, memory tracking and bounds checking need to be optional, because some of those allocations might become level allocations later, or vice versa.
  • Linear allocations during a frame, e.g. all allocations done by the render queue. The render queue is built once each frame, is fixed-size, needs the memory to be linearly allocated for increased cache utilization, and normally involves several threads for multi-threaded rendering. Again, bounds checking and memory tracking are optional.
  • Stack-based allocations, e.g. allocations having level lifetime. Level loading can usually be done in a stack-based fashion, which means that allocations are freed in their reversed order. Additionally, different levels need a different amount of memory, hence the amount of needed memory is variable.
  • Object pools, e.g. for particles. These are fixed-size, hold a certain number of instances, can be single-threaded or not, and provide O(1) allocation and deallocation.
  • Chunk-based allocations, e.g. for streaming level data cut into equally-sized chunks. The maximum amount of memory is fixed, allocations often need to be thread-safe, and the allocator can additionally offer on-the-fly defragmentation.
  • One-frame allocations, used for all temporary allocations during a frame. Often, you need to allocate memory just for one frame in order to get some work done, immediately freeing the allocated memory afterwards. This is comparable to the linear allocations coming from the render queue mentioned above, but often you want to have different allocators in order to be able to provide better cache utilization. Some of them need not be thread-safe, others do.
  • Two-frame allocations, e.g. for temporary results used in the next frame, such as ray-casts done by an SPU, which are used in the next frame (if a latency of one frame is tolerable).
  • General-purpose allocations, e.g. coming from 3rd party libraries/middleware, or temporary variable-sized allocations coming from the file-system, needed for decompressing a file into an auxiliary buffer.
  • Short-term temporary allocations, e.g. used for auxiliary allocations needed while performing an algorithm or similar. Those can be allocated from the stack (!), benefitting from increased data locality, not fragmenting any of the heap allocators.

The list above is not an exhaustive one, and still we’ve already identified lots of different allocation strategies involved, along with all the optional bells and whistles like bounds checking, memory tracking, memory marking and thread-safety. For another bunch of examples, read the excellent Game Engine Architecture by Jason Gregory of Naughty Dog.

Let’s do a quick summary of requirements before discussing how we can merge those into our memory arena system.

Bounds checking

Bounds checking is a useful tool for finding memory stomps by adding e.g. a 4 bytes magic value at the start and end of each allocation. Upon freeing an allocation, we can check whether the memory still contains the magic value – if not, it must have been overwritten, resulting in corrupted memory and hard-to-track-down bugs.

Bounds checking comes in 3 flavours:

  1. No bounds checking (retail builds).
  2. Simple bounds checking, as described above. Useful for checking if memory stomps happen.
  3. Extended bounds checking, which checks the magic value of all allocations upon each allocation/deallocation. Useful for finding out when memory stomps happen.

Memory tracking

Memory tracking is used for finding memory leaks by keeping track of each allocation made during the lifetime of an application. Memory tracking comes in several flavours:

  1. No memory tracking (retail builds).
  2. Simple memory tracking. Can already detect the presence of leaks by counting the number of allocations e.g. via ++numAllocations and –numAllocations. Does not detect where leaks are coming from (very fast, usable in all builds)
  3. Extended memory tracking. Additionally stores the file name and line number where the allocation was originally made (still fast, needs additional memory).
  4. Full memory tracking. Stores a full callstack for each allocation – useful every now and then for hard-to-track-down leaks (slow, needs even more memory).

Memory marking

Memory marking helps in finding dangling pointers by marking memory with certain bit patterns/magic values. The Visual Studio Debug Heap implementation already does that, and marks newly allocated heap memory with 0xCDCDCDCD, and freed memory with 0xDDDDDDDD. The bit patterns are carefully chosen such that accessing pointers will trigger a segmentation fault, an unaligned read (on some CPUs), or else.

Memory marking can be either turned on or off.

Thread-Safety

Thread-safety of allocation routines comes in several flavours:

  1. Single-threaded.
  2. Multi-threaded.
  3. Lock-free or wait-free.

Memory location

We want to be able to decide whether allocations are made on the heap or on the stack. Yes, you have to be careful when doing the latter, but it can vastly improve performance and lessen fragmentation of the heap.

Allocation strategies

As already mentioned above, there are several allocation strategies used in a game:

  1. Linear allocations.
  2. Stack-based allocations.
  3. Chunk-sized allocations.
  4. Object pool allocations.
  5. Growing object pool allocations.
  6. General purpose allocations.

Summary

Phew! As can be seen, there’s lots of requirements the memory system needs to fulfill. So far, we have identified at least 6 different allocation strategies, 2 locations where allocations can come from, 3 kinds of thread-safety, and several purely optional kinds of bounds checking, memory tracking and memory marking. Still, we want to be able to arbitrarily combine all of the above into simple, easy-to-use memory arenas.

Next time, we will see how to do exactly that. In the meantime, try to think of possible solutions!

Advertisements

10 thoughts on “Memory system – Part 4

  1. Pingback: Memory system – Part 5 | Molecular Musings

  2. Pingback: Memory allocation strategies: a linear allocator | Molecular Musings

  3. Hey Stefan,

    I have more of an overall design question – with all the different types of allocators how do you handle passing them around?

    Say we have a level resource “Texture” which is allocated using either a linear or stack allocator (that is perhaps owned by the “Level Loader”) but to actually load the texture we might want to load the texture data into a temporary buffer which only has the lifetime of the texture loading routine. This would potentially mean two allocators must be passed down into the “Texture Loader”.

    That might be a poor example but maybe you can shed some light on practical ways to use this memory system in a large project.

    • Hi wolf,

      Basically, classes in Molecule like the TextureLibrary, ShaderLibrary, etc. each expect a memory arena upon initialization. A ResourcePackage is built offline by the asset pipeline, and stores all data that goes into a package. I think this is somewhat similar to packages in the Unreal Engine.

      In your example, why do you want to load texture data just for the lifetime of the texture loading routine? Is this perhaps for a texture like the loading screen? If so, there’s at least 2 alternatives I can think of:

      1) In Molecule, you would simply load two different packages. One for the loading screen, one for the actual level. The loading screen TextureLibrary would then use a temporary allocator, the other library would use a stack-like allocator. This means you never have to pass down two allocators anywhere, but rather dictate lifetime “from the outside”, e.g. the order & lifetime in which you construct & load packages.

      A simple example:

      ResourcePackage rpLevel(“level.meb”, levelArena);
      {
      ResourcePackage rpLoadScreen(“loadingScreen.meb”, loadArena);
      rpLoadScreen.Load();
      rpLevel.Load();

      }

      At the end of the scope, all resources needed by the loading screen would automatically be cleaned up. This is the preferred approach, because resource libraries never need to have intimate knowledge about how resources are loaded and how long they need to live. I had similar examples to this in my blog post about Singletons.

      2) You could build your own allocator that can be switched between stack-like allocations and temporary allocations. When loading the textures, you could simple switch between these 2 modes. I don’t like that approach too much because sooner or later the texture loader itself knows about temporary allocations, which is too much “intrinsic” knowledge for my taste.

      • I think I probably used a bad example. What brought me to asking this question was that I have a Material resource which is made up of JSON – to parse the JSON I previously would load the entire file into a temporary buffer and then pass that onto the thirdparty JSON library I’m using.

        I had read your post about Singletons and that is what brought me to improving my resource system by not having one global “manager”. In my old system the actual resource classes themselves implemented the Load/Save logic.

        The design I currently have is that there are ResourceLoader and ResourceCache classes (I’m contemplating turning ResourceLoader into ResourceIO instead though and have it handle serialize and deserialize of resources – what do you think?).

        Essentially a cache is given a loader, and a loader is given an arena. In the loader it does something like this:

        ResourceLoader::Load (name)
        {
        ResourceHandle handle = NEW(ResourceHandle, arena);
        FileStream fs = fileSystem->Open(name);
        Load(fs, handle);
        }

        That is all good and seems alight to me but once loading is then passed on to the overloaded Load method which is implemented by a specialised resource loader this is where it gets problematic:

        class MaterialLoader : public ResourceLoader
        {
        Load(fileStream, handle)
        {
        // I need an arena for temporary allocations
        char* tempBuffer = ???(fileStream->GetSize());
        JSONParser parser(tempBuffer);

        …..
        }
        }

        Perhaps my design isn’t very good and this is maybe a bit off topic but it’s the first major component I’m trying to tackle using your proposed memory system.

        Thanks!

      • Well, that specific “problem” is not really due to using the proposed memory system, but rather using different allocators for different allocations – which you totally should do!
        That being said, don’t think too hard about problems like this. There’s a few things you could do:

        1) Use the proposed solution from my previous answer. That is, you can always build your own arena which allows you to switch between resident and temporary allocations.

        2) Give the MaterialLoader its own arena for temporary allocations. It can simply be a member of the MaterialLoader, initialized with a reasonable amount of temporary memory. The allocator itself could use the temporary memory (e.g. 64KB or so) as a ringbuffer.

        3) Use a global temporary allocator. If I’m not mistaken, that’s how the guys from BitSquid do it. There’s nothing wrong with having a global in situations like this, as long as you don’t wrap all your globals into Singletons because “it’s nicer” and as long as you don’t build your codebase onto the foundation of >30 Singletons.

        In Molecule, the asset pipeline prepares all data files so that every texture, mesh, shader, etc. can be read in one go. The only temporary allocations that are made are done when reading from zip-files, and the arena for that is part of the decompressor.
        Does that help in the grand scheme of things, or are there further questions?

  4. Thanks for the thorough replies!

    With option 2 this seems like a reasonable solution but this wouldn’t be as memory efficient? I was under the impression you allocate one large block for the entire application, or perhaps very few, and each allocator dips into this memory area. For this to work wouldn’t you need to pass around a HeapArea or StackArea to each *Loader so its temporary allocator can use a portion? If not then wouldn’t allocating a new area for temporary allocations for each Loader be exactly the same as just using the default “new” and allocating on the heap when necessary?

    Option 3 seems nice also but I’ve been trying to keep away from Singletons and globals – in any case, what type of allocator best suits a global temporary allocator? The ones proposed don’t seem to really fit IMO because with a linear or stack allocator you can’t free the allocations without either freeing everything (linear) or by freeing in reverse order (stack) – none of which work well for a global allocator used by various systems.

    You make a good point about the asset pipeline and that is something I’m also looking into – basically have all resources compiled as blobs and therefore no JSON parsing is done in the engine. That’s a whole different kettle of fish though and something I’ll tackle next.

    Thanks again!

    • You wouldn’t need to pass around areas or arenas or anything like that. The loader could have a member, or you could make it a static member if you wish to share the memory between all loaders. That’s up to you. And no, it would not be the same as using default “new”, because that would allocate from anywhere on the heap. If you’ve got different areas (say area A and B), allocations from area A never reside in the address space of area B, and vice versa. That’s a major difference!

      Regarding the allocator, you’ve got several options. You can use something like a double-ended stack allocator, where resident allocations are done from the bottom, and temporary allocations are made from the top. Depending on what kind of temporary allocations you do and how big they are, a reasonable solution would also be to use a ring-buffer allocator. E.g. you could get 4 MB once, and then just use them as a ring buffer. Of course you have to make sure that once the allocator wraps around you no longer need those old allocations – like I said, depends on how big your temporary allocations are.

      Don’t forget that you can always build your own custom allocator doing exactly what you need.

      • I ended up improving my content pipeline and loading in binary blobs which means I no longer need to load into a temporary buffer. I did however implement a simple ring-buffer allocator as you suggested. Thanks again! Looking forward to future articles.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s