Memory allocation strategies: a pool allocator

As promised last time, today we will see how pool allocators can help with allocating/freeing allocations of a certain size, in any order, in O(1) time.

Use cases

Pool allocators are extremely helpful for allocating/freeing objects of a certain size which often have to be created/destroyed dynamically, such as weapon bullets, entity instances, rigid bodies, etc.

Most of those objects are created/destroyed in completely random order, due to their dynamic nature. Therefore, it is desirable to be able to allocate/free memory with as little fragmentation as possible. Pool allocators are extremely well suited for that.

How does it work?

Simply put, a pool allocator allocates a chunk of memory once, and divides that memory into slots/bins/pools which fit exactly M instances of size N. As an example, consider we want to have a maximum of 256 bullets in flight at the same time, each bullet having a size of 32 bytes. Thus, the pool allocator would allocate 256*32 = 8192 bytes once, dividing it into slots which are then used for allocating/freeing objects of size 32.

But how are those allocations made? How can we guarantee O(1) time? How can allocations be made in any order, without fragmentation?

Freelists

The answer to all of the above boils down to the use of what is called a free list. Free lists internally store a linked-list of free slots inside the allocated memory. Storing them inplace is crucial – there is no std::vector, std::list, or similar that keeps track of free slots. It is all stored inside the pre-allocated pool of memory.

The way it is usually done is the following: each slot (32 bytes in our example) in the pool of memory is connected to the next slot simply by storing the pointer to the next slot in the first few bytes of the slot.

Assuming our pool of memory sits at location 0x0 in memory, the layout would be something like the following:

         +---------+---------+---------+---------+
         | 0x20    | 0x40    | 0x60    | nullptr |
         +---------+---------+---------+---------+

         ^         ^         ^         ^
         |         |         |         |
address: 0x0       0x20      0x40      0x60

The blocks denote the slots in memory, the bottom row shows the address in memory. As can be seen, the memory at 0x0 would contain a pointer to 0x20, which would contain a pointer to 0x40, and so on. We have just formed an intrusive linked-list in our memory pool.

There is one thing to note here: as long as slots are free, we can store anything we want inside those 32 bytes. When a slot is in use, we don’t need to store anything, because that slot is occupied anyway and so no longer part of our free list. All we have to do is remove a slot from the free list whenever it is allocated, and add it to the linked-list again whenever it is freed.

Implementation

Let us take a look at the inner workings of a free list:

class Freelist
{
public:
  Freelist(void* start, void* end, size_t elementSize, size_t alignment, size_t offset);

  inline void* Obtain(void);

  inline void Return(void* ptr);

private:
  Freelist* m_next;
};

The only member we need to store is a pointer to the free list, which simply acts as an alias in memory, and stores a pointer to a currently free slot in our memory pool.

Leaving alignment and offset requirements out of the equation for now, initializing a free list is quite simple:

union
{
  void* as_void;
  char* as_char;
  Freelist* as_self;
};

// assume as_self points to the first entry in the free list
m_next = as_self;
as_char += elementSize;

// initialize the free list - make every m_next of each element point to the next element in the list
Freelist* runner = m_next;
for (size_t i=1; i<numElements; ++i)
{
  runner->m_next = as_self;
  runner = as_self;
  as_char += elementSize;
}

runner->m_next = nullptr;

With the intrusive linked-list in place, allocating/freeing memory really becomes an O(1) operation, and is just ordinary linked-list manipulation code:

inline void* Freelist::Obtain(void)
{
  // is there an entry left?
  if (m_next == nullptr)
  {
    // we are out of entries
    return nullptr;
  }

  // obtain one element from the head of the free list
  Freelist* head = m_next;
  m_next = head->m_next;
  return head;
}

inline void Freelist::Return(void* ptr)
{
  // put the returned element at the head of the free list
  Freelist* head = static_cast<Freelist*>(ptr);
  head->m_next = m_next;
  m_next = head;
}

I moved the free list implementation to its own class so it can be used by both the non-growing and growing variant of the pool allocator. Furthermore, it is handy for other things, too.

Alignment requirements

In order to satisfy alignment requirements, we need to offset our slots into the memory pool once, so all slots can satisfy the same offset and alignment requirements. The disadvantage of this approach is that a pool allocator can never satisfy more than one offset requirement, but such cases should be very seldom.

The way it is done in Molecule is that users have to provide their maximum object size and maximum alignment requirement when constructing the pool allocator. The allocator can then satisfy all alignments <= maximumAlignment and object sizes <= maximumSize, and simply asserts in all other cases. This at least allows the user to allocate objects of different sizes (such as e.g. 4, 8, 12, 16) out of the same pool, with different alignments (such as e.g. 4, 8, 16, 32), if desired.

Control is in the user’s hands, so it is up to the user to either use different pools for different allocations (often recommended), or live with some wasted memory inside the memory pool.

Usage

Usage is simple. The following shows a free list which is able to satisfy allocations up to a size of 32 bytes and an alignment of 8 bytes. Note that the free list takes a range of memory in which it initializes itself. This ensures that free lists can be used for allocations on the stack, on the heap, and by different allocators (non-growing and growing).

ME_ALIGNED_SYMBOL(char memory[1024], 8) = {};

core::Freelist freelist(memory, memory+1024, 32, 8, 0);

// allocates a slot of 32 bytes, aligned to an 8-byte boundary
void* object0 = freelist.Obtain();

// allocates another slot of 32 bytes, aligned to an 8-byte boundary
void* object1 = freelist.Obtain();

// obtained slots can be returned in any order
freelist.Return(object1);
freelist.Return(object0);

The pool allocator described in this post simply has a Freelist instance as member, and forwards all Allocate() calls to freelist.Obtain(), and all Free() calls to freelist.Return(). Additionally, it asserts that allocation sizes and alignment requests fit the initial maximum sizes provided by the user.

Outlook

The pool allocator was the last remaining non-growing allocator we haven’t discussed yet. Starting with the next post, we will take a look at how to implement growing allocators by means of virtual memory and allocating physical pages from the OS.

28 thoughts on “Memory allocation strategies: a pool allocator

  1. Thanks for sharing! I’m just wondering where you are storing book keeping info in the described case when the user is allowed to allocate objects with different sizes and different alignments.

    • Short answer: I don’t.

      Long answer: With the given maximum alignment and maximum size, I first determine the minimum slot size, and align the first slot accordingly. E.g., given a maximum size of 24 bytes and an alignment of 32 bytes, each slot would become 32 bytes in size.

      Whenever an allocation is made, I can simply assert(allocSize <= maxSize) and assert(allocAlignment <= maxAlignment). By definition this works because if memory is aligned to N bytes, it is also aligned to all powers-of-two smaller than N. Similarly, if a user wants to allocate e.g. 20 bytes out of the above-mentioned pool, he will actually get a 24-byte slot, but doesn't know about that.

      Of course that creates some wasted memory which cannot be reclaimed by other allocators, but I find it to be very convenient, and the user has complete freedom over when to use which pools. Alternatively, one could be very strict, and only allow allocations and alignments of the same size. This would ensure that no memory is wasted.

      • “By definition this works because if memory is aligned to N bytes, it is also aligned to all powers-of-two smaller than N.”

        Oh, this is what actually slipped my mind. Thank you very much for clarification.

  2. Great post! Thanks for sharing!

    I have actually two questions:
    1) How would I make growable pool? The straight forward solution is obviously returning indices instead of pointers. But I wonder if I can have several free lists that I link together?
    2) Is there an easy way to iterate the free list and skip unused elements? I am thinking about using the free list as an “array with holes”, but if I use a templated implementation I cannot easily put some magic value into the unused elements. Do you have any suggestions for this use case or do you recommend something else?

    • Thanks Dirk!

      Regarding your first question, there’s two possibilities:
      1) Request new chunks of memory (=pages) from the OS whenever the pool is full. These chunks store book-keeping information, and are held in a doubly-linked list so they can be easily added/removed.

      So whenever the pool is full, you allocate a chunk, initialize its header and a freelist in that chunk. Chunks that are full can be placed at the back of the list, so that you only need to check the freelist of *one* chunk to see if a slot is available from the freelist. That still keeps Allocate() and Free() at O(1). Purging is O(n) because you need to walk the list of chunks/pages to see which one can be returned to the OS.

      2) An easier solution is to reserve virtual address space up-front, and only initialize a freelist as you would with a non-growing pool. Now whenever a pool is full, you just request physical pages from the OS for a part of the virtual address space you reserved. This means that all freelists live right next to each other, and can simply be initialized without having to keep a doubly-linked list of pages.

      As an example, you could reserve address space for 64 MB, request physical memory for 32 KB, and initialize the freelist. As soon as the pool is full, request another 32 KB of physical memory from the OS, and initialize a freelist there.
      Note that there is no need to “connect” these two freelists using a linked list or anything like that.

      With this implementation, you have to specify an upper-limit for how large the pool can grow. Some consider this a bad thing, I consider it a good thing because it forces people to think about how much memory they are going to need. And because it’s only virtual address space you reserve (not allocate!) you can always specify some ridiculously high value, if you want to. Especially on 64-bit.
      Because virtual memory (or at least reserving address space, backing it with physical pages later) is available on all console platforms (and even on mobiles like the iPhone), this is the way it’s done in Molecule.

      Regarding your second question, let me think about that :). Can you give me an use-case for what kind of allocations/algorithms you’d like to use that?

      • A simple solution for your second question might be to just add an extra byte (or more likely 4 extra bytes) to each slot which you use for storing magic values, or some kind of extra information. E.g. a pool for allocations of 32 bytes could consist of slots having 36 bytes – the only difference being that you still only hand 32 bytes to the user, and use the remaining 4 bytes for whatever you want.

        When iterating the freelist, you can simply walk through the pool with 36 byte increments, and check the magic value to see whether a slot is free or not. Depending on the pool and allocation size that could be slow because you likely need to pull a lot of cache-lines in :(.

      • Thank you very much for your blog!
        You are certainly not afraid of going low-level, unlike most of my colleagues =) I’m heavily missing this spirit in my everyday job…

        As a potential hobby project, I’m currently thinking about creating a more efficient version of Google Sparse Hash Table (https://github.com/sparsehash/sparsehash). It has to do a lot of allocations and deallocations of sizes 8*k for 1<=k<=32|64, so I'm currently thinking about using a dedicated pool allocator for each k to reduce memory overhead and make allocations faster.

        I feel that your second solution (single large contiguous pool) is inferior compared to the first one (list of small separate pools). First of all, requiring a hard upper bound on size may be acceptable in game development, but it is heavily discouraged in many other areas. Aside from that, it is perhaps not a good idea to maintain a single free list in a huge (e.g. 64 Mb) pool. Because after many random allocations/deallocations your free list could end up going through many OS pages in random order. This is the well-known locality issue of pool allocator, and it may increase pressure on TLB. The first solution is better in this regard, since it tends to allocate consecutive elements from only a few pages.

      • Hi Stepan,

        Thanks for the kind words!

        It has to do a lot of allocations and deallocations of sizes 8*k for 1<=k<=32|64, so I'm currently thinking about using a dedicated pool allocator for each k to reduce memory overhead and make allocations faster.

        Yes, that sounds like a smart idea. I’ve used a setup like this in the past where allocations <= 256 bytes would go into their separate pools of sizes like 32, 64, 96, 128, etc. Very fast for small allocations coming from middleware, strings, stuff like that.

        Aside from that, it is perhaps not a good idea to maintain a single free list in a huge (e.g. 64 Mb) pool. Because after many random allocations/deallocations your free list could end up going through many OS pages in random order. This is the well-known locality issue of pool allocator, and it may increase pressure on TLB.

        But that is true for both solutions, isn’t it? It very much depends on the order of allocations/deallocations. If the order is completely random, you’ll end up hitting different pages in both solutions. The first solution (separate pools) makes it easier to build a growing pool though, as you’ve already stated.

  3. Hi Stefan!

    Thanks for the fast reply. So regarding your first example the simplest chunk I can come up with would look something like this:

    struct Chunk
    {
    Freelist mFreelist;
    Chunk* mNext;
    };

    What else would go into the chunk header/footer? Is there really no need to link the free lists? Say I grab the last element in free list in some chunk. I need to link it to the new freellist, right? Or am I missing something.

    Your second approach sounds very interesting. I am not too much into virtual address spaces. Do you have a link where I can study a bit about these things (in our context here)? Your outlook for your blog gives me the impression you might talk about this in the future. Is this correct? Obviously this would be awesome for me 🙂

    • For storing the chunks in a doubly-linked list, you need both a Chunk* mNext and a Chunk* mPrevious in your Chunk struct. Chunks should be stored in a doubly-linked list so that adding and removing chunks is both an O(1) operation.

      At first I also thought that I somehow need to link the freelists of newly created chunks to previous ones, but if you think about it, you will realize that you actually don’t have to link them.

      Consider the following example:
      struct GrowingPoolAllocator
      {
      Freelist m_freelist;
      // other members not needed for now
      };

      At first, the freelist is initialized in whatever memory region you allocated for it, nothing fancy. Now consider the case where the last slot has been allocated: The Freelist* pointer stored inside the freelist (that is, m_freelist.m_next) will be a nullptr, indicating that there are no more slots left.
      If you now allocate a new chunk, all you have to do is initialize your freelist in the following way:

      m_freelist.~Freelist();
      new (&m_freelist) Freelist(startOfNewChunk, endOfNewChunk, …);

      This first properly destructs the freelist by calling its destructor, and creates a new instance by using placement-new at the address of m_freelist. This initializes a new freelist in the newly allocated chunk.

      Now think about what allocating a new slot will do: it will simply grab a slot out of the freelist, which now has plenty of space again.
      What happens if we free any of the slots that live in the “old” freelist?
      Actually, it’s quite simple: the slot will “automagically” be added to the freelist living in the newly created chunk, because its m_next member will now point to exactly that slot. This means that whenever allocations from the old slots are freed, they will automatically become part of the new freelist.

      Essentially, the slots from all freelists in every chunk allocated will automatically merge and be available without you having to link them explicitly. That’s the beauty of the system – if you go for the solution that stores individual chunks instead of reserving address space, all that’s left to do is link the chunks together in the doubly-linked list.

      Regarding the outlook on future posts: Yes, I will talk about virtual memory, address space, physical pages and that stuff in future posts :). I will also discuss how to implement growing allocators – stack-based, pool, and micro allocators for small-size allocations.

      • I see. That is indeed a good idea. I guess the only gotcha is that free list doesn’t own the memory and you keep track of the allocated chunks to free them later properly. Very nice solution. Thanks for pointing this out :).

  4. I thought about this too and I didn’t like it for the same reasons you point out. Any header or footer will add extra memory and also doesn’t work nice alignment requirements.

    Another idea was to let the user specify some fill value which you can check or some bit field at the end of the free list which marks free/used nodes in the pool. There was something like this in GPG8. I will check this out.

    I am physics programmer and I am looking for a good allocation strategy for rigid body contacts. You allocate and free quite a lot of them each frame and for obvious reasons I don’t want them to create any traffic on the heap. Even though we have a pretty good small object allocator I prefer a specialized solution.

    • Yes, storing bits at the end of a pool sounds like a good idea! It should not create too much wasted memory because you’re only storing individual bits, and for many object sizes you will end up having unused space at the end of the pool anyway. That space can be used for storing bits that indicate free/used slots.

      I assume rigid body contacts are always the same size, storing info about penetration depth, contact normals, and the objects involved in the collision (or something along those lines)? If so, then rigid body contacts sound like the perfect candidate for a pool allocator :).

      • There are two types of contacts essentially. Contacts between convex shapes and contacts between convex shapes and meshes. Both contact types store so called manifolds which describe the contact geometry. The contacts should *not* own the manifolds since this wastes quite a lot of memory because not every contact is actually touching, but only the AABBs in the broad phase overlap. So manifolds should most definitely come from a fast pool allocator. Also if you think about sleeping you might want to purge the manifolds, but not the contacts (but I am not sure if this is still necessary since this affects quality if things wake up and can lead to gameplay bugs like in Halo. Maybe on low memory devices you still want to do this though). I think the actual contacts should come from a pool allocator as well, they are are little bit more persistent than manifolds, but if things move you create a lot of traffic on their memory allocator as well.

        For creating the contact constraints for the manifolds in the solver I need to iterate the free list (hence the second part of my question). Initially I just had an intrusive linked list of manifolds, but for the large worlds that I am seeing for next generation hardware just iterating such a link list is too slow due to the cache misses. Even if you do nothing, but just iterate it takes some significant amount of ms. So I was thinking about “array with holes” to iterate faster.

        I hope this gives you some context. Ideally I would like the manifolds in a contiguous array, but this would involve too much copying – I guess. Maybe there exists some other solution, but I am not aware of any at the moment.

  5. Pingback: Adventures in data-oriented design – Part 3c: External References | Molecular Musings

  6. Hey Stefan,

    I was reading this article as I was preparing to write a pool allocator myself and was wondering if this allocator is used along with the templated style mention in your first memory management series. It seems to be quite different from that implementation. I don’t imagine this Freelist/PoolAllocator can be used as a template parameter in your templated memory manager? Please correct me if I am wrong though. And I would love to hear about the details of how that works if it is.

    Thanks,

    Joe

    • The FreeList isn’t used as an allocator directly, but rather used by the PoolAllocator itself.
      The reason for that is that under certain circumstances, you only need a FreeList and not a full-blown allocator. Furthermore, an allocator in Molecule also needs to be able to return the size of allocations, and store additional book-keeping information about the allocator’s load, avg. and max. memory consumption, etc.

      This is the reason why the FreeList and PoolAllocator are two distinct things in my case.

  7. Pingback: Job System 2.0: Lock-Free Work Stealing – Part 2: A specialized allocator | Molecular Musings

  8. Hi Stefan,

    thank you for this great series. It helps me a lot in better understanding and managing my memory. Keep up the great work!

    I would like to ask you about the GrowingPoolAllocator. If the maximum element size is given by the user to the constructor of the GrowingPoolAllocator, how does the allocator (and its FreeList) know how many additional bytes will be needed by the BoundsCheckingPolicy of the arena?
    The FreeList obviously needs to know the max element size requested by the user PLUS the additional bytes for the BoundsCheckingPolicy used by the arena.
    How did you solve this in a good way?

    Thanks
    Simon

    • Hi Simon,

      Thanks, glad you like the blog!

      If all you have is just the pool allocator itself, you don’t need to care about extra bytes, because there won’t be any bounds checking involved.
      If you have a memory arena that uses such an allocator, the memory arena can make sure to pass the correct arguments to the allocator.

  9. That makes sense in terms of having a clear separation between allocator and arena. I’m curious how you go about doing the latter. Do you do this in the arena constructor? Would that not introduce an additional arena constructor parameter to hold the max object size?

    In your post you said that the user constructs the allocator with the max object size and max alignment. I take it that this is where the allocator would walk through the given memory space and initialize its FreeList.

    Next, the user passes the allocator to the constructor of the arena. So it should look similar to this:
    typedef MemoryArena GrowingPoolArena;
    ME_ALIGNED_SYMBOL(char memory[1024], 8) = {};
    GrowingPoolAllocator alloc(memory, memory+1024, 32 /*maxObjSize*/, 8 /*alignment*/);
    GrowingPoolArena arena(alloc);

    Now if I were to allocate a 32-byte object with that arena, which uses a 4-byte bounds checker, the arena will request 40 bytes from the allocator. This is a problem since our allocator can only accommodate a maximum of 32 bytes per element.
    Since the allocator is constructed outside of the arena, how can the arena let the allocator know of the increased object size before we initialize the memory with the FreeLists? Furthermore, the arena does not even have any knowledge that the allocator uses fixed object sizes.

    I’m hoping you can shed some more light on this. It’s probably something really simply but I just can’t figure it out.

    • It’s simple, actually.
      You can let the user construct the allocator, and put that into any arena. This puts the burden on the user to ensure that the correct sizes (including extra storage for bounds checking) are passed to the allocator.
      Another alternative I added to my code is to offer a different memory arena implementation that takes care of this, and takes an extra argument in its constructor. I call this a MemoryPoolArena – it’s basically the same as any other arena, but knows that the allocator is going to be a pooled one.

      You can enforce the allocator to be a pooled one either at runtime by using assertions, or at compile-time by using traits/SFINAE/enable_if – whatever floats your boat.

  10. Hi.

    I was wondering how guard bytes are handled with pool allocator. Lets say user specifies, that it wants a pool allocator with max object size 32B, with some arbitrary alignment.
    Is this size somehow increased to include also the guard bytes ?

    Thank in advance.

      • One last question. Since allocator and arena completely separate, how does this enflated size trickle down to FreeList. I presume the order of creation is as follows:

        HeapAreaPolicy heapPolicy;
        PoolAllocator allocator(heapPolicy.GetStartAddress(),
        heapPolicy.GetEndAddress(), alignment, element_size);
        MemoryArena arena(&allocator);

        Since freelist is a member of pool allocator, it can’t be initialized in PoolAllocator ctor since it needs the MemoryArena, to get the enflated element size. Does this mean that pool allocator and free list actually initialized in ctor of MemoryArena, or is there a more elegant solution.

        Thanks!

      • I introduced three static functions in the MemoryArena (and in the allocators themselves as well, but for a different reason):


        static size_t GetMemoryRequirement(size_t size);
        static size_t GetMemoryAlignmentRequirement(size_t alignment);
        static size_t GetMemoryOffsetRequirement(void);

        These functions take into account bounds checking and other potential overhead and can be called from anywhere because they’re static (the overhead depends on the template policies only).
        When constructing the pool allocator, you can now simply pass down the adjusted values.

        I also introduced a FixedSizeHeap class template that holds an area, the allocator and an arena, taking care of all of the above internally.
        This made it really easy to use fixed-size heaps for certain things in the engine and game.

Leave a Reply to Stefan Reinalter Cancel reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.