Adventures in data-oriented design – Part 1: Mesh data

Let’s face it, performance on modern processors (be it PCs, consoles or mobiles) is mostly governed by memory access patterns. Still, data-oriented design is considered something new and novel, and only slowly creeps into programmers’ brains, and this really needs to change. Having co-workers fix your code and improving its performance really is no excuse for writing crappy code (from a performance point-of-view) in the first place.

This post is the first in an ongoing series about how certain things in the Molecule Engine are done in a data-oriented fashion, while still making use of OOP concepts. A common misconception about data-oriented design is that it is “C-like”, and “not OOP”, and therefore less maintainable – but that need not be the case. The concrete example we are going to look at today is how to organize your mesh data, but let’s start with the pre-requisites first.

Data-oriented design

Much has been said and written about data-oriented design already, and there’s some nice resources on the web. Mike Acton (Technical Director at Insomniac Games) is a well-known advocate of data-oriented design, and has some interesting slides on the web as well.

I will not go into detail about data-oriented design in general, so let me quickly sum up what for me data-oriented design is all about:

  1. Think about data first, and code second. Class hierarchies aren’t important, but data access patterns are.
  2. Think about how data in your game is accessed, how it is transformed, and what you end up doing with it, e.g. particles, skinned characters, rigid bodies, and tons of other examples.
  3. When there’s one, there’s many. Think in streams of data.
  4. Be aware of the overhead of virtual functions, pointers to functions, and pointers to member functions.

Just to make everybody aware of the importance of point 1 and point 2, consider a very simple example:

char* data = pointerToSomeData;
unsigned int sum = 0;
for (unsigned int i=0; i<1000000; ++i, ++data)
{
  sum += *data;
}

In the above example, we take the sum of a million bytes, nothing more. No fancy tricks here, no hidden C++ costs. On my PC this loop takes about 0.7ms.

Let’s slightly alter the loop, and measure its performance again:

char* data = pointerToSomeData;
unsigned int sum = 0;
for (unsigned int i=0; i<1000000; ++i, data += 16)
{
  sum += *data;
}

The only thing that has changed is that we are summing every 16th element now, that is we have changed ++data to data += 16. Note that we still take the sum of exactly 1 million elements, only different elements this time!

The time this loop took? 5ms. Let me spell that out for you: This loop is more than 7 times slower than the original, eventhough we’re accessing the same amount of elements. Performance is all about memory access, and how to make good use of your processor’s caches.

Update: Shanee Nishry has conducted similar performance experiments on her blog, and tested them on different architectures as well.

Remember point 3, “When there’s one, there’s many. Think in streams of data.”? What does this mean? It’s simple: if you have one texture, mesh, sound sample, rigid body, etc. in your game, you’re going to have many of them. Write your code in a way that it can deal with many of them at once – think in terms of streams of objects. That doesn’t mean that you need to get rid of your Mesh and Texture classes, and turn them into a MeshList and TextureList, respectively. This will be dealt with in our later example.

Last but not least, point 4: “Be aware of the overhead of virtual functions, pointers to functions, and pointers to member functions.” is something that is very close to my heart. It is something often ignored (or forgotten about) by less experienced programmers, and I always make sure to teach my students the underlying costs of virtual function calls and the likes. Still, sooner or later they will end up being used in an inner loop, and will kill your performance.

To state it bluntly, you should never use a virtual function, a pointer-to-function, or a pointer-to-member-function on a per-object basis – not for each primitive group of a mesh, not for each particle in your particle system, not for each texel in a texture; you get the idea. They will likely cause both instruction and data cache misses, and kill your performance – be aware of it!

Dealing with static mesh data in a data-oriented way

Having said that, let’s start with today’s example of how to deal with static mesh data in a data-oriented way. There’s certainly many ways in which you can organize your data, so what we’re interested in is the difference in performance between a completely object-oriented design, and a design more concerned about data access patterns.

The scenario of our example is the following:

  • We want to render 500 static meshes, and perform view-frustum culling on them.
  • Each mesh has a vertex buffer, index buffer, and about 3 sub-meshes on average.
  • Each sub-mesh stores a starting index and the number of indices used in the vertex buffer/index buffer of the mesh it belongs to. Additionally, each sub-mesh has a material and an axis-aligned bounding-box.
  • A material contains a diffuse texture, and a light map.

Certainly nothing extremely complicated, which makes it easier to see the difference between the two approaches.

The by-the-book OOP approach

Let’s start with the class design of the object-oriented approach:

class ICullable
{
public:
  virtual bool Cull(Frustum* frustum) = 0;
};

class SubMesh : public ICullable
{
public:
  virtual bool Cull(Frustum* frustum)
  {
    m_isVisible = frustum->IsVisible(m_boundingBox);
  }

  void Render(void)
  {
    if (m_isVisible)
    {
      m_material->Bind();
      context->Draw(m_startIndex, m_numIndices);
    }
  }

private:
  unsigned int m_startIndex;
  unsigned int m_numIndices;
  Material* m_material;
  AABB m_boundingBox;
  bool m_isVisible;
};

class IRenderable
{
public:
  virtual void Render(void) = 0;
};

class Mesh : public IRenderable
{
public:
  virtual void Render(void)
  {
    context->Bind(m_vertexBuffer);
    context->Bind(m_indexBuffer);

    for (size_t i=0; i<m_subMeshes.size(); ++i)
    {
      m_subMeshes[i]->Render();
    }
  }

private:
  VertexBuffer* m_vertexBuffer;
  IndexBuffer* m_indexBuffer;
  std::vector<SubMesh*> m_subMeshes;
};

Hopefully, some of you will already cry at the sight of such a design. One might think that the above is an exaggeration of OOP design gone wild, but let me assure you that I’ve seen far worse in shipped games. The contender for the worst example was something along the lines of the following:

class GameObject : public IPositionable, public IRenderable, public IScriptable, ...

I cannot remember the exact details, but the actual implementation inherited from 6 or 7 classes, some of them having implementations, others just having pure virtual functions, others having virtual functions for the sole purpose of making dynamic_casts work on them… I don’t want to point fingers, but rather reinstate that such designs can be found in shipped, commercial games and are not made-up examples – but let’s not digress.

Back to the original example, what is so bad about it? Well, there’s several things:

  1. The two interfaces IRenderable and ICullable cause a virtual function call hit on each call to Render() and Cull(), respectively. Furthermore, virtual functions almost never can be inlined, further deteriorating performance.
  2. Bad memory access patterns when culling meshes. In order to read the AABB, a whole cache line (64 bytes on modern architectures) will be read into the processor’s cache, whether you need it or not. This boils down to accessing more data than necessary, which is bad for performance, as could be seen in our very first simple example (summing 1 million bytes).
  3. Bad memory access patterns when rendering sub-meshes. Each sub-mesh checks its own m_isVisible flag, which again will pull more data into the cache than necessary. Same bad behaviour as in the example above.
  4. Bad memory access patterns in general. Points 3 and 4 above assume that all the meshes’ and sub-meshes’ data is linearly laid out in memory. In practice, this is seldom the case, because programmers often just slam a new/delete in there, and be done with it.
  5. Bad memory access patterns for future algorithms. As an example, in order to render geometry into a shadow-map, no materials need to be bound, only the start index and number of indices are needed. But each time we access them, we pull the other data into the cache as well.
  6. Bad threadability. How would you perform the culling process of several sub-meshes on different CPUs? How would you perform the culling on SPUs?

So how can this situation be improved? How can we organize our data so that it offers good performance now and in the future?

The data-oriented design approach

The obvious things to get rid of are the two interfaces, and the virtual function calls associated with them. Believe me, you don’t need those interfaces.

The other thing we can change is to make sure that data that needs to be accessed together actually is allocated next to each other in memory. Furthermore, it would be nice if this was somehow done implicitly, without the programmer having to worry about memory allocations that much.

Going down the DOD route, such a design could look like this:

struct SubMesh
{
  unsigned int startVertex;
  unsigned int numIndices;
};

class Frustum
{
public:
  void Cull(const float* in_aabbs, unsigned bool* out_visible, unsigned int numAABBs)
  {
    // for each AABB in in_aabbs:
    // - determine visibility
    // - store visibility in out_visible
  }
};

class Mesh
{
public:
  void Cull(Frustum* frustum)
  {
    frustum->Cull(&m_boundingBoxes[0], &m_visibility[0], m_boundingBoxes.size());
  }

  void Render(void)
  {
    context->Bind(m_vertexBuffer);
    context->Bind(m_indexBuffer);

    for (size_t i=0; i<m_visibleSubMeshes.size(); ++i)
    {
      if (m_visibility[i])
      {
        m_materials[i]->Bind();
        const SubMesh& sm = m_subMeshes[i];
        context->Draw(sm.startIndex, sm.numIndices);
      }
    }
  }

private:
  VertexBuffer* m_vertexBuffer;
  IndexBuffer* m_indexBuffer;

  std::vector<float> m_boundingBoxes;
  std::vector<Material*> m_materials;
  std::vector<SubMesh> m_subMeshes;
  std::vector<bool> m_visibility;
};

Let’s examine this approach’s characteristics:

  1. Good memory access patterns when culling sub-meshes. In order to cull N elements, exactly N*6 floats are accessed in memory. Visibility is written to a seperate destination in memory, so no overhead there either.
  2. Good memory access patterns when rendering sub-meshes. All the data needed is stored contiguously in memory, and we only load data into the cache which is actually needed.
  3. Good memory access patterns for future use. If we want to render sub-meshes into a shadow-map, we only need to access m_subMeshes, nothing more.
  4. Good memory access patterns in general. Memory need not be allocated manually in order for the elements to be contiguous. A std::vector always is, and so is a simple array. You can go further and put all your Mesh instances into a seperate container as well, if you want.
  5. Easier threadability. If culling (or any other operation, for that matter) needs to be split across multiple threads (or SPUs), we can divide the data into chunks, and feed each chunk of data to different threads. No synchronization, mutexes, or similar needed in those cases, and the solution scales almost perfectly, making it easy to take advantage of 4-core, 8-core or N-core processors, which will be the future of PCs and consoles.

I don’t claim that the presented solution is totally perfect – it certainly isn’t, it just served as an example. Points of improvement from the top of my head:

  1. Instead of storing bools into the out_visible array, we could directly store the sub-meshes’ data, saving us both the branch and further indirections when rendering the visible meshes in Mesh::Render().
  2. Taking it even further, we can directly store rendering commands into the GPU buffer if we want, e.g. by writing into a command-buffer on the PS3.
  3. If your content pipeline supports it, bake all the static data into one single, big Mesh. All the static mesh data for a whole level is then guaranteed to be contiguous in memory, so you can reap the benefits of the above solution automatically.

What performance improvement can we expect to see from such a design change? In the experiments I conducted (500 meshes, 3 sub-meshes each, not counting Direct3D11 API call overhead), the difference between the two presented solutions was about 1.5ms.

If that doesn’t sound like much, consider that 1.5ms actually make up almost 10% of a 60Hz frame. Furthermore, we’re talking about 500 meshes here, and have only touched the design of static mesh data. If data-oriented design is applied throughout the whole engine, the performance benefits can be enormous.

And in case you wondered, of course the data-oriented design won. Let’s end this post by saying that you need to start caring about your data access patterns now, I mean it. So does Scott Meyers.

Advertisements

32 thoughts on “Adventures in data-oriented design – Part 1: Mesh data

  1. This sounds all really nice in theory, but for me there is one essential question unanswered: How to access these meshes from outside/user code and support also adding and removing of new mesh instances? Also if you want to support some kind of streaming you need this entity management. One may solve this problem by implementing some kind of “smart-handles” which store an index to address these “property-rows”.
    In this implementation I also see a little bit of memory wastage, because you store four std:vectors for your properties. Here you can implement some kind of “property vector” template which stores arrays of properties like your bounding box and so on, combined with the basic std::vector functionality. So you can generalize the SoA principal.

    Cheers,
    Michael

    • Let me try to answer your questions one at a time.

      1) The Mesh class shown is for static meshes, which are all compiled/baked in the content pipeline. Nothing dynamic there, the amount of memory needed is fixed, allocated once, and all the data is directly loaded into memory. So there isn’t any functionality for adding/removing meshes in this class.

      For dynamically created meshes (spawning entities in scripts, for example), I use an entirely different data structure, because you need to ensure that memory doesn’t get fragmented during game-play. You can either use a pool allocator for that (using implicit free-lists and other tricks), or just use an array, and swap the last element with the one to be removed, so everything stays contiguous in memory. That”s what I use, because the copy is done only once, and copying a few bytes doesn’t hurt much, but has the nice property of keeping everything next to another in memory. Of course you need to call destructors manually, and make use of placement new, but that can be handled easily.

      2) Streaming is handled in an entirely different way. Streaming tends to fragment memory, hence its best to set aside some amount of memory which is used for streamed resources, and defragment that memory on-the-fly. That can be done by either implementing a smart defragmentation scheme, or by cuting up all streaming resources into pieces of 8/16/32 KB. The latter makes defragmentation a no-brainer (all that needs to be done is fill holes with a simple memcpy), and is what Naughty Dog uses for streaming on the PS3, from what I know.

      3) You’re absolutely right about using some kind of handle or smart-pointer. In Molecule, I don’t return raw-pointers to resources, but use another level of indirection instead, which makes it possible to track dangling pointers and other problems I’ve encountered in the past. Especially dangling pointers led to a lot of crashes in completely unrelated code, because memory was overwritten someplace else.
      Furthermore, Molecule supports hot-reloading for every kind of resources in the running game, and handles make it far easier to support something like that.
      I didn’t want to clutter the example with details like handles, and so used raw-pointers instead.

      4) I’ve used std::vectors in the example only, again to make the code clearer for most of the readers, I hope. You’re right that this wastes memory, but that is something I wouldn’t really be concerned about much (it’s a few bytes, but it can sum up). What for me is even more important is that a std::vector can grow, which is something I absolutely don’t want to happen anywhere in my run-time code.

      I don’t use any STL container in the run-time code – no std::string, no std::vector, no std::map. One of the reasons is given above, but the other important thing people tend to forget about is that you cannot move them in memory, and you can’t directly memcpy into them, eventhough it might work on a particular STL implementation, but those are different across platforms. So a blob or other data structure is the way to go. Using plain old arrays in the example above would be a perfectly valid approach.

      I hope that clears things up, please let me know if there’s further questions!

      • Wow thanks for clarifying these points so in detail!
        1-2) So you are separating the management of static, dynamic and streamed meshes resources? The entirely static mesh management you only need to support a other kind of game genre aside of the big ones which needs streaming or I’m missing smth?
        In our engine we combine both approaches (static world and streaming) by using paging. So if you don’t need a huge world you only create one “world page” and thats it. To avoid the memory fragmentation we also use free-lists and pools in each subsystem which hold the actual components.
        Could you please explain the idea of splitting up resources into chunks a bit more?

        4) I think it really depends on the situation. I also strictly banned most of the STL containers like lists and so on, but I think there are situations where you won’t be able to pre-estimate the size of e.g. your mesh pools. Again there comes the example of “streaming of world pages” into my mind.
        Certainly one could only work with fixed size arrays driven by a config file and change this from game to game, but this isn’t very comfortable for me.
        I also recently switched to the EASTL which have some nice hiperf containers on board (like the fixed containers).

      • 1-2) Yes, they are separate. But they still can live in memory next to each other, the only difference is that their allocation scheme differs. Molecule’s memory system (explained in other posts) makes it really easy to e.g. say:

        HeapArea meshHeap(64*1024*1024);
        LinearArena staticMeshArena(meshHeap, whatEverSizeYouWant);
        PoolArena dynamicMeshArena(meshHeap.GetMemory() + whatEverSizeYouWant);

        And then you just allocate from whichever arena you want. Which allocators/arenas are used is completely up to the user, you just hand those to the engine classes. So you could use pools for both if you want, or allocate in an entirely different way if you need.

        About streaming, IMO only the *really* big games (GTA, RDR, Uncharted, and the likes) need a purely streamed world. And there’s many of reasons to still have static meshes in the engine, like e.g. things which are turned on/off during game-play, which could live in memory, be loaded once, and rendered at a later point in time. Furthermore, the decision of either using a purely streamed world, or throwing in a loading-screen in-between levels is up to the user (engine clients).

        4) You’re right, there are situations where you cannot pre-estimate the size of some allocation. Yet many times, you can work around that by using a slightly different allocator. Let me try to explain this with a simple example:

        Assume our game needs to load some assets which it needs all the time (menu screens, GUI textures, whatever). These have application lifetime. Furthermore, the game is strictly separated into levels, so each level loads its resources, and frees them whenever the player leaves the level. These resources have level-lifetime.
        One possible allocation strategy would be to have two linear allocators, each of them using a fixed size, e.g. having two LinearArena allocators. The problem with that approach is that as soon as you need more memory in either of the two, they need to be adjusted.
        However, if you were to use a stack-based allocator instead, you could define it’s size once, and both application and level-lifetime allocations use that allocator. Then it doesn’t matter so much which one of the two allocations need more memory, as long as they fit into the fixed budget – and especially in console games, you *have* to deal with fixed budgets, otherwise you’ll hunt down fragmentation and other kinds of memory problems at the end of the project. Been there, done that. Never gonna do it again.

        Even if you cannot estimate a fixed size for some allocations, you need to define an upper bound, e.g. you cannot just say “stream everything” without having defined how much memory there is to spare for the whole streaming system. So define an upper bound first, and carefully think about the allocation scheme second. That has worked really well for me, but your mileage may vary.

        I cannot state that enough, but if you *really* think about memory allocations carefully, you very seldomly need a completely dynamic and general-purpose memory allocator. IIRC, all Insomniac Games have clearly defined memory budgets in which certain allocations must fit – and these games are by no means small games.

        In addition, keep in mind that every console has extra memory for development which you can always use as a fallback for failed memory allocations (in conjunction with a big, fat error message), which makes it more practical to keep allocations fixed-size, and adjust their size every now and then in the config-file. I would suggest using a similar scheme on PC as well (I do).

      • About splitting up resources into chunks: The idea is to split *all* resources into equally-sized chunks, e.g. 512kB. This means that you can use a simple pool-allocator, and never worry about memory fragmentation when loading/unloading resources in the game.

        However, this means that all resource data needs to be laid out in a manner that permits division into such equally-sized chunks. This in turn means that you cannot have any resource which spans more than 512kB of contiguous space in memory, which could be hard to achieve if you’re not careful.

        The benefit of such a scheme is a) not having to worry about fragmentation and b) exceptionally fast streaming if the chunk-size is a multiple of the OS’s I/O size/media sector size.

        Personally, having to divide everything into 512kB chunks sounds like a risky drawback to me, but Naughty Dog did Uncharted that way, so…

      • First thank you for the in-deep explanation of this heap management strategy and I think I’ll definitively improve ours in this way. I’ve read a similar strategy in the book “Game Engine Architecture” from Jason Gregory.
        The benefit of using a config file to specify memory budgets explicitly is to also support consoles and IMHO the PC would also be more happy with such a memory layout.

        Using resource chunks to optimally use these allocators is a great idea, but I’m also a bit scary with this approach. But let’s try it out! 😉

      • You’re welcome!

        Neither the linear, stack nor double-ended stack allocators are something entirely new, games already used these allocation strategies back in the olden days of PS1/PS2/Xbox/etc. But it’s nice that Jason Gregory explains them in his book – my first contact with those allocators was when I started working on my first console title. Coming from the PC, I’ve never heard about those allocators before :). And yes, the PC *is* more happy with such a memory layout.

        You can find more details about the resource chunks scheme in Jason’s book, by the way.

  2. While I’m certainly not disputing the points here – doing things in groups is better than not doing them in groups, virtual function calls can hurt your performance, etc. – let me get this straight.

    You got a 1.5 ms boost for 500 meshes of 3 submeshes each, on PC. Assuming a 2 GHz core, it’s 3M cycles for 1500 submeshes, or 2K cycles of overhead for each submesh.

    The actual overhead that I can see from the provided code is:
    – virtual function call overhead (indirect call, basically – cache misses in this case disappear after the first call, and even branch prediction may work)
    – additional cache misses – i.e. for ‘nothing is visible’ case we have ~1 cache miss for each submesh in the original case (64b cache line) and ~1/2 cache misses for each submesh in the optimized case
    – some function call overhead, maybe some setup overhead in frustum culling function (this is unlikely though)

    I don’t see how these combined would account for 2K cycles of overhead for each submesh. Even on a console this translates to 3K cycles of overhead per submesh which is something like 5 cache misses – I can’t see these either; in fact, for 3 submeshes per mesh the data cache miss count OTOH does not seem too different over 500 meshes in these two cases.

    Is the actual code very different from the examples here, or am I missing something?

    • A very good and interesting question!

      One of the differences between the two given examples is that the first one stores pointers to sub-meshes, while the later stores them contiguously (a std::vector of SubMesh* in the first case, and a std::vector of SubMesh in the second case).

      For the experiments I did, each SubMesh* was allocated independently, not necessarily living next to each other in memory. This causes a lot more cache misses, and I admit that it is not completely clear from my post.

      Furthermore, the virtual function call overhead only disappears if we only ever deal with Mesh*, but in a design like the given one, most of the time you will hold an array to a certain number of IRenderables, and render them in order, causing a lot more cache misses in turn. This could be somewhat alleviated by sorting the array by type, but I’ve never seen anybody do this when dealing with base class pointers only. To clear that up, in the profiling I did, I’ve actually used more classes than just Mesh and SubMesh, which caused a lot more virtual function overhead – maybe that wasn’t stated clearly enough.

      Thanks for asking, I’m sorry I didn’t explain those two things more thoroughly in the post.

  3. Question, why are you using 4 separate vectors to store the bouncing boxes, materials, sub-meshes and visibility flags for the Mesh, would you not have cache-misses, since the data is dynamically stored and not stored linearly, or am I mistaken? i.e. Instead of:

    std::vector m_boundingBoxes;
    std::vector m_materials;
    std::vector m_subMeshes;
    std::vector m_visibility;

    Wouldn’t this be better?

    struct ImplMeshData
    {
    float* boundingBoxes; // vector?
    bool isVisible;
    Material* material;
    SubMesh subMesh;
    // padding, if required
    };

    std::vector m_data;

    • Having them in separate vectors (or preferably a contiguous block of memory) only pulls in the data you actually need, so you have less cache misses than with the usual AoS (array-of-structures) approach.

      If you only need the bounding boxes for culling, and store all the meshes’ data in a vector like in your example, you’re fetching all the other data (isVisible, material, subMesh) from memory, even though you don’t require it. Remember that a CPU always reads in cache-lines, so you pull in lots of unused data.

      On the other hand, if you store them separately, you only pull in the data you need for your specific use case. Culling would only touch the bounding boxes and the visibility flags, nothing more. Rendering would only touch materials and submeshes. That’s the difference between a AoS and SoA (structure-of-arrays) approach.

      Note that in practice I would use arrays and allocate the memory for all of them contiguously, but I didn’t want to clutter the post with such details, but keep concentrating on the main points.

      • sorry, I know this was posted last year, but I think in the example code you should actually put mesh properties together instead of separating them to different arrays. Say most of objects in your scene are visible, then every time m_visibility[i] is true, memory fetch needs to jump from m_visibility array to m_materials array and after that to m_subMeshes array. So I’m a bit confused. Could clarify on that?

      • The jumping around is not the problem, accessing data you aren’t going to need is.
        Let’s say you fetch the first entry from m_visibility, then 64 bytes (assuming m_visibility is aligned to a cache line, and the cache line size is 64) will be read into the cache.
        When we read m_materials afterwards, these 64 bytes will be read to a different location in the cache. Those don’t overwrite what we previously read.

        That’s a simplification, but basically the gist of it. For more details, please refer to Ulrich Drepper’s “What Every Programmer Should Know About Memory”.

  4. Could you explain a little more about how your data oriented Mesh class might interoperate with a custom allocator?

    for example, your Mesh class has a number of std::vector’s. In production code what kind of allocator would they use.

    If they used a linear allocator, wouldn’t that eat up a lot of memory the moment the vector needed to grow? Or do you use staticly sized arrays in real production code?

    do you use one (linear) allocator for all your static meshes per level or do you use different allocators for the different data in a mesh?

    • In production code I don’t use vectors, but my own container class(es). In this case a resizable (but not automatically growing) array. The static meshes are all allocated using the same allocator, assuming that those meshes stay during the whole level. If so, then they use the level allocator which is either a linear or stack-like allocator.

      Generally speaking, I tend to group allocations based on their lifetime and frequency. For statically sized level allocations I would use the same allocator for all things that need to be in the level at all times. Dynamically allocated assets (spawning units, dynamic meshes, particles, etc.) would use a different allocator.

      More general allocator strategies (when to use which allocator) is discussed in more detail in my slides from the Game Connection Paris master class, which are also available on this blog if you’re interested.

      • So you would not use a vector like array for dynamic (very much growing) arrays?

        Instead use a list of pointers to objects instead and allocate those in a pool allocator for example?

        in your example ‘std::vector m_boundingBoxes’ doesn’t really grow at all, it’s loaded once and never grown afterwards, right?

      • > So you would not use a vector like array for dynamic (very much growing) arrays?
        > Instead use a list of pointers to objects instead and allocate those in a pool allocator for example?

        vectors mostly have the disadvantage that you temporarily need more memory just for growing. E.g. if you have N elements and grow the vector, you temporarily need space for 3N (N+2N) elements, eventhough you only use 2N (assuming a vector grows by doubling its size). There’s other alternatives which could be better suited, e.g. making use of the virtual memory system which means you only ever need 2N but never 3N memory.
        If you don’t need the elements to be contiguous from start to finish, a growing pool allocator is also a good alternative. It really depends on the use case: frequency of insertion, deletion, and type of memory access. If you want O(1) insertion, deletion, and linear, contiguous memory access, deleting elements by swapping them with the last element of the array is a useful trick. This works like a charm if you handle those arrays internally and never return references to the outside world. If you want to do that, you could always use an ID or handle-based system on top for external references.
        But I digress… generally speaking there’s lots of alternatives.

        > in your example ‘std::vector m_boundingBoxes’ doesn’t really grow at all, it’s loaded once and never grown afterwards, right?

        Exactly. Don’t use a std::vector if there’s no need to grow. I’ve seen far too many cases where people use a std::vector, don’t call resize() or reserve(), and just use push_back() eventhough they know how many elements the vector will contain. Please don’t do that :).

  5. Line 12 of the “by-the-book” OOP example currently reads `m_isVisible = frustum->IsVisible(aabb);`. Should this not be `m_isVisible = frustum->IsVisible(m_boundingBox);`?

    Excellent articles; I’m reading through all of them!

  6. My question is the following. In your Data friendly code you have:

    std::vector m_boundingBoxes;

    So I assume you have the min x,y,z coordinates of the bounding box followed by the max x,y,z coordinates of the same box. Then the coordinates of the next submesh’s bounding box etc. etc.

    Could you not also have:

    std::vector m_boundingBoxes;

    Would this not also be the same in terms of being cache friendly?

    • Ok the comment for some reason messed up I meant:

      std::vector [float] m_boundingBoxes;

      replace this with

      std::vector [AABB] m_boundingBoxes;

      ie. use the AABB struct as opposed to just floats. Would this not still be equivalent in terms of the cache.

      • Template brackets are constantly messed up by WordPress ;).

        In this case, yes, it would be as efficient as just storing floats, because we access all of them in a linear fashion anyway. I just did not want to introduce additional types in the example.

        However, storing data in AOS vs. SOA could make a difference, in general.

  7. One more question,

    I see how you have changed the submesh references from AOS to SOA in the mesh structure. So now when you reference the m_visibility parameter you pull in the adjacent visibility parameters for the mesh into the cache line. So subsequent accesses to the parameter are now in the cache.

    But what about if you have several mesh structs in an array for example? And not only that each individual mesh struct has a varying number of submeshes?

    In this case you cannot guarantee that the m_visibility parameter data will be contiguous across all the meshes in the array. How do you guarantee that? Allocating a fixed number of meshes won’t help you because internally the number of submeshes varies. You could say have fixed amounts of memory for submesh storage in the mesh but again this will not get you contiguous memory for submesh parameters across several meshes. ie. the submesh m_visibility data won’t be in contiguous regions of memory from one mesh to the next, it will only be contiguous within the mesh instance.

    Or do you draw the line somewhere and realize NOTHING can be perfectly contiguous. So you just concentrate in this case on contiguous parameters ONLY within EACH mesh as opposed to across all meshes?

    • First, sorry for the long delay. I was on vacation, and am currently battling jetlag and the flu.

      Second, there are two options I can think of:

      • You can store the visibility array for all submeshes, and forget about meshes for a moment. If you’re doing submesh culling, then lay out the data as appropriate. Meshes could then simply be an index (and count ) into the submesh array.
      • Don’t create a separate visibility array at all, but rather store just the indices of visible submeshes. When iterating, use the indices to work on the corresponding submeshes. I wrote the code using the visibility array because I thought it would be easily understood by most of my readers, and because it doesn’t stray too far from the OOP example.

      But yes, all in all, you have to draw the line somewhere. There will be cache misses, some of them are unavoidable. Just try to lay out the data as good as possible for the task you’re doing.

      • No problem. I knew you would respond eventually. And thanks for answering my questions.

        One thing about structures of array data (SOA) it is important in making decisions to lay out the data for maximum cache efficiency. For example:

        Assume you are going to interpolate a translation vector and a rotation quaternion.

        I assume it is best to have an array of translation data and an array of orientation data like so:

        Vector4 Translations [NUM_TRANSFORM_COMPONENTS];
        Quaternion Rotations [NUM_TRANSFORM_COMPONENTS];

        The above is the best layout because the Translations and Rotations will be placed in separate cache lines when they are accessed. I assume that with this scenario the maximum allowable amount of data that you are processing will be in placed in the 2 separate cache lines.

        The other way of laying this out would be perhaps (AOS)

        struct sTransform
        {
        Vector4 Translation;
        Quaternion Rotation;
        };

        sTransform TRANSFORMS [NUM_TRANSFORM_COMPONENTS];

        The problem with this setup is that when you iterate the TRANSFORMS array of structs, you will pull in the Translation and Rotation data into the SAME cache line.

        While it is true that you will be using both elements for your interpolation, with the SOA example you will pull in MORE data into SEPARATE cache lines. Therefore with the SOA example more data will be available to you in the cache since it is stored in separate cache lines as opposed to being stored in a single\shared cache line with the AOS example.

        Am I correct in my assumption or completely off in the understanding of the cache and how it operates?

      • Not exactly. It’s not so much about pulling data into different cache lines, but more about how to effectively make the most of the cache capacity available to you. It’s better to touch data which is already in the cache, than fetching data from memory.

        Consider your example under the assumption that we are going to touch both Translations and Rotations. Let us further assume that NUM_TRANSFORM_COMPONENTS is 64, and that both the Vector4 and Quaternion store 4 floats, hence 16 bytes. Then it doesn’t matter whether we store the data in AOS or SOA fashion. Why?
        Because we always touch all data, 64*16*2 = 2048 bytes.

        In AOS fashion, we touch 64 sTransform components, each one containing a Translation and a Rotation. Assuming an architecture with a cache line size of 64 bytes, this means that each cache line contains: 16 byte Vector4 Translation, 16 byte Quaternion Rotation, followed again by 16 bytes Vector4, and 16 bytes Quaternion. So we load 2048 bytes worth of data, and use all 2048 of them.

        In SOA fashion, we touch 64 Translations, and 64 Quaternions, but the outcome is the same. We pull in 2048 bytes into our caches, and actually need them all.

        However, consider an algorithm were we only need the translation data, but not the rotation data. Then AOS vs. SOA makes a difference:
        In the AOS case, even though the C++ code only touches sTransform.Translation, the caches will still be loaded with sTransform.Rotation data, because it follows the translation data in the C++ memory layout of the struct. This means that for getting stuff done, we only need 1024 bytes of translation data, but still load the caches with 2048 bytes. This means that 50% of our cache’s capacity are wasted, and we waited for data to arrive in the cache which we never needed in the first place.

        In the SOA case however, we only pull in the 1024 bytes we actually need. This means that the we used the cache’s capacity as best as we could, and we only needed to wait for 1024 bytes to arrive in the cache – and not 2048.

        This is where the difference in performance comes from.

      • Thanks for the explanation. I think I now know where I went wrong. In SOA the cache line would be:

        [translation] [translation] [translation] [translation]

        and in another separate cache line we will have:

        [quaternion][quaternion][quaternion][quaternion]

        In AOS it would be:

        [translation][quaternion][translation][quaternion]

        HOWEVER where I went wrong is assuming that in memory only a cache line’s worth of data (64 bytes) is loaded from memory into the cache. Is that the case or are multiple cache lines (each being 64 bytes) loaded when we reference memory that is not in the cache?

        Because if only 64 bytes is the only thing loaded when uncached memory is referenced I still think SOA will win over AOS. And here’s why. With SOA you will have 4 translation components in one cache line and 4 quaternion components in another cache line. So you will be able to process FOUR sTransform components using the SOA method before having to pull memory into the cache.

        With AOS since the translation and rotation are both in the same cache line, only TWO sTransform elements at a time can be processed before more data needs to be brought in the cache.

        Of course this assumes that memory is brought into the cache using the size of a cache line. I am assuming this is what is happening? If not then I don’t understand why the second loop where you are referencing every 16th element is 7 times slower if multiple cache lines are filled with data when you reference memory.

      • Memory is always brought into the cache in the size of a whole cache line. There are a few things worth noting though, because I’m not 100% sure if I understood you correctly:

        • Some CPU architectures (e.g. Intel chips) do automatic prefetching of memory you are likely going to need. On PowerPC hardware (e.g. last-gen consoles) you had to do that yourself. Of course, prefetching mostly helps when you access your data linearly, and not when you’re making huge jumps all over the place.
        • When accessing memory that is not in the cache, a whole cache line (e.g. 64 bytes) is brought into the cache. Note that of course there are several cache lines which are “in flight” at the same time, so you can have several KB of data in the cache, depending on how memory addresses map to cache addresses. This is known as Cache Associativity.

        For all the details, refer to Ulrich Drepper’s “What Every Programmer Should Know About Memory”.

  8. That’s a truly excellent post. The example is very enlightening. May I just ask you if theoretically, couldn’t we push the conceptual framework even further and instead of having a Mash class, having one single Meshes System?

    I mean, suppose you have a Meshes System, or Meshes Component, or whatever we want to call it (to avoid being unfairly accused of incurring in the singleton sin). Suppose we have 500 meshes. Then the Meshes system would hold a bunch of arrays like position[500], scale[500], aabb[500], etc to store the data that is respective to each mesh. Then when another system, say the Physics System, needs to interface with the meshes, it directly refers to the exact attributes within the Meshes System that are needed.

    The problem, of course, is how to handle submeshes. This become less trivial. I can think of a couple of solutions. Solution A is only partially DOP, although yet an improvement over OOP. For the submeshes one could use nested arrays, like the following. Suppose again that we have 500 meshes, each with 1 to 10 submeshes. And for the sake of simplicity suppose each only have 3 stored things: position, size and aabb. Then the Meshes System would have an array submeshes[500][3][1to10]. Therefore, the first index refers to which mesh we are talking about. The second index refers to which attribute we are talking about. And the third index refers to which submesh we are talking about. I say that this solution is less DOP because we would still have only one big array holding submeshes’ stuff.

    Solution B is more DOP. Instead of having only one big array for the submeshes, we would actually have one submesh-related array for each attribute. In this case, the Meshes System would have position[500], scale[500], aabb[500], but also submesh_position[500][1to10], submesh_scale[500][1to10] and submesh_aabb[500][1to10]. That way, we could refer directly to the specific attribute of the submeshes we want to interface with from another system. Also, data would be more linearly displayed into memory.

    What do you think of all this? I mean, specially the overall concept: I think that if we are not afraid of fully departing from OOP, we can accept the proliferation of “system classes”, like a PhysicsSystem, a MeshSystem, etc, and then just transform all info of the former objects into entries in arrays. Anyways, just thinking aloud after reading your very enjoyable article. Thanks for your time!

    • You can certainly store individual mesh data as SOA and let the systems handle access to it. Handling submeshes also wouldn’t be too difficult: use arrays for the actual submesh data, and let a Mesh hold offsets into the submeshes’ data arrays.
      At one point or another you have to access memory in a non-linear fashion. Best to find a tradeoff between performance and usability. It’s also the reason why I’m leaning more towards AoSoA-style data layouts for some things because it simplifies certain aspects of development – but that is going to be the topic of one of the next posts.

  9. Pingback: Linear vector-oriented programming | Drifting in Yellow

  10. I’m new to this DOP or DOD, still learning, but I’m starting to understand it by reading this post and reading other readers’ comments. I have 🙂 question too, maybe not practical or something but trying to see if I have correctly understood the cache memory and misses:
    In OOP style for example I would have simple class SimpleClass which has some primitive data x and y. For example:
    class SimpleClass {
    //**** Some of class’s methods. ****
    typex getX() { return m_x; }
    typey getY() { return m_y; }

    typex m_x;
    typey m_y;
    };

    By reading other posts about DOP, this can be easy translate to cache friendly class:
    class MyData {
    //**** Maybe some methods still don’t know :). ****
    vector m_xes;
    vector m_yes;
    };

    class SimpleClass {
    //**** Again those class’s methods. ****
    typex getX() { return MyData.m_xes[m_id]; }
    typey getY() { return MyData.m_yes[m_id]; }

    /* This would actually represent to real class data from the MyData struct. */
    size_t m_id;
    };

    Now let say we have simple algorithm:
    class Algorithms {
    void doSomethingOnAllData(const vector& obj) {
    // for all SimpleClass objects from obj do something.
    obj.getX().againSomething(obj.getY());
    }
    };

    // This is initialized somewhere.
    vector m_myObjects;

    I’m still learning so I have tried nothing from this yet. So I assume that this code from above will be cache friendly when doing that doSomethingOnAllData() algorithm. Will it still be cache friendly in case I do some sorting on m_myObjects vector of SimpleClasses, but not on those arrays in MyData? I assume that it will be still more cache friendlier then OOP design, but still not as optimal as it should be. I just wanted to see your explanation about this one.

    I assume it won’t be optimal as it should be because it will randomly iterate through MyData arrays. So the best case scenario is that the m_objects vector is aligned as arrays in MyData (ie. not sorting it at all 🙂 ). It all depends what will be put on the cached line. Will some of the fetched cache data be used or maybe not? By thinking, it seems that it won’t, so maybe this is not either optimal solution, and probably not practical?

    Thanks.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s