Schema-based entity-component data for fast iteration times

Serialization, reflection, and other mechanisms are often used for saving data in an editor or a tool like the asset pipeline, and then loading that data into the engine at run-time. This process is well-known, flexible, and allows us to store the data in any format conceivable. Still, all those techniques show certain weaknesses when it comes to keeping iteration times to an absolute minimum.

Before introducing the concept of so-called “component schemas” as an alternative to serialization- and reflection-based techniques, let us try to identify their deficiencies.

Serialization

Fundamentally, serialization is the process of converting the state of an object into a format that can be stored or transmitted and reconstructed later, turning a bunch of bytes into a valid object. In the land of by-the-book OOP in C++, serialization of objects can easily be achieved by using inheritance, virtual functions and an (abstract) base class. We can simply tack the “this can be serialized”-property onto objects of specific classes by deriving from a common base class that provides the serialization interface.

This solution could be implemented like in the following example:

class ISerializable
{
public:
  virtual void Serialize(ISerializer* serializer) = 0;
};

class MyComponent : public ISerializable
{
public:
  virtual void Serialize(ISerializer* serializer)
  {
    serializer->Serialize(“color”, m_color);
    serializer->Serialize(“some float value”, m_float);
    serializer->Serialize(“some int value”, m_int);
  }

private:
  Color m_color;
  float m_float;
  int m_int;
};

All that needs to be done in order to serialize something is instantiate a particular serializer (e.g. a TextSerializer), and call Serialize() on all objects that implement the serialization interface. This is a pretty common approach, which has a few nice properties:

  • Serialization works in both directions. Data can be read and written using the same method, there is no need to implement two separate versions like e.g. Load() and Save().
  • The format the data is stored in depends entirely on the serializer implementation. Object data can be stored as a human-readable text file, a binary file, or any other file format.
  • You can serialize almost arbitrary data structures when members of non-built-in types such as Color can serialize themselves because they also implement the interface.

Of course, this approach also comes with disadvantages:

  • The serialization process is buried in the code itself. For editing purposes, we often want to be able to define meaningful default, min and max values for certain properties, and maybe provide the user with tooltips or similar pieces of information. In order to support that, we could add helper functions that let us specify additional data to be used by the serializer, but now code is cluttered with information that is essentially only needed for editing purposes.
  • All classes that should be serialized need to inherit from a base class. In the case of POD classes that adds a virtual function table pointer to each instance, which makes it impossible to memcpy() instances around in memory. In the case of single inheritance, this makes the classes in question use multiple inheritance (ugh!), which we rather want to avoid.
    Furthermore, it causes additional overhead when using objects in containers because now constructors have to be called when adding and moving objects.
  • It creates coupling between the game/engine and the editing tools. What if the editor is written in a different language than C++ (which is likely)? The editor needs to know which properties can be edited, but doesn’t have access to the corresponding C++ code. One can work around that by using specially crafted serializers on the editor’s side, but that introduces additional unwanted complexity into the whole process.

In general, the approach is not too shabby, and you can certainly ship games with it. However, the biggest gripe I have with it is something that is often referred to as being one of the advantages of such an approach:

“You only have to add one single line of C++ code in order to serialize new members, and make them appear in the editor! This is super-fast, and only takes a few seconds.”

In theory, yes.
In practice, not so much.

Speaking from experience, the process of adding a new member to the Serialize() method often goes like this:

  • For a certain feature in the game (e.g. a mini-game), the designer needs to be able to set the value of a new property in the editor. He politely asks the programmer if he could add that value “to the editor”.
  • The programmer responsible for the mini-game is currently working on something completely different, and can either ignore the designer’s request for now and do it at a later time, or shelf all of his current work and make the change.
  • Adding the single line of code takes about 10 seconds, but then a new executable needs to be built. Compiling a single file is fast, but linking can be really slow, sometimes in the order of minutes if the project is large enough.
  • Once the new executable is ready, it needs to be made available to the designer. This means either submitting the newest version to source control, or copying it to some location on the company’s intranet.

No matter how you twist and turn it, in reality only having to change a single line of code can actually take 10 minutes or more, depending on the processes involved.

Reflection

Even though reflection itself is different from serialization, in terms of turn-around times it also has the same disadvantages as serialization-based mechanisms do. Because proper reflection abilities are not built into the C++-language, one has to come up with either preprocessing-, macro- or template-based approaches, all of which can be ugly, bloated, or both. Again, the code has to be touched by some tool or the compiler in order to reflect the newest changes (no pun intended). This again leads down the route of having to wait for more than 10 minutes because of a single code change.

Introducing a schema-based approach

In order to turn this 10-minute process into something more reasonable, we have to cut out the middleman: no code changes required, no executable, everything needs to be data-driven. If possible, we would like to have a system that:

  • Allows the designer to add new properties without having to bug a programmer.
  • Allows the team to use the same executable as before, gracefully handling the data that has been added by the designer/editor.
  • Allows the programmer to do the code changes whenever he’s ready.

In Molecule, I introduced the concept of component schemas, which are files that dictate what properties/fields a component consists of, what their types are, what their default values are, and so on. In a way, they can be compared to database layouts. As an example, this is what the schema of a MotionBlurComponent looks like:

Schema =
{
  name = "MotionBlurComponent"
}

Fields =
[
  {
    name = "shutterSpeed"
    type = "float"
    display = "Shutter speed"
  }
  {
    name = "centerWeight"
    type = "float"
    display = "Center tap weight"
  }
  {
    name = "maxBlurRadius"
    type = "int"
    display = "Max. blur radius"
  }
  {
    name = "noiseBlend"
    type = "float"
    display = "Amount of noise"
    min = 0.0f
    max = 1.0f
  }
  {
    name = "sampleCount"
    type = "intArray"
    display = "Number of samples per pass (2 passes)"
  }
  {
    name = "sampleOffsetScale"
    type = "floatArray"
    display = "Sample offset scale per pass (2 passes)"
  }
]

Schema files use the familiar format introduced in the last post, and contain the following information:

  • The name of the component.
  • An array of fields, where each field at least needs to have a name and a type data value.
  • Additionally, each field supports default, min, max, and display data values.

All the schema files are stored in the same directory, so the content pipeline just loads all the files contained in that directory, and stores the schemas in a schema dictionary. Whenever the content pipeline needs to build the binary run-time data for a component, it looks up the corresponding schema in the dictionary, and builds a binary representation of the data according to the schema. The resulting data is stored binary, and additionally contains an array containing hashes for all fields’ name values.

At run-time, this allows the engine to deal with newly added data it knows nothing about by using a fast hash-based lookup, just looking for the data it expects. Furthermore, it enables the engine to generically spawn components using a factory and techniques similar to the data translator introduced last time, e.g. some kind of data structure or utility class that is able to apply values to members of a certain class (using pointers-to-members). Of course this is a trade-off between having the flexibility offered by the schema-based approach, and the binary loading ability offered by e.g. serialization-based approaches, but in this case it is a trade-off I am more than willing to accept. Compared to mesh data, texture data, animation data and pretty much all other kinds of asset data, component data is usually very small, hence a bit of extra work upon loading won’t be noticed.

Nevertheless, in addition to the generic run-time loading system, the engine allows to add specialized classes (so-called Spawners) to the factory, which directly create a component from the binary representation without having to do a hash-based lookup, making the process as fast as possible for components where such a spawner implementation is provided. This effectively lets the user choose between the generic approach that works all the time, and a specialized approach that just loads binary data, which might be useful if you have several thousand large-ish components. Of course, by adding such a spawner you lose the ability to change schema files on-the-fly without providing a new executable, because the binary representations will no longer match.

In conjunction with a content pipeline that supports hot-reloading of data, this makes it possible for the editor to pick-up changes to any schema file in real-time, updating the relevant GUI widgets without having to close either the editor or the engine. As an example, a designer could add a new AABB field to a schema, immediately see it exposed in the editor, edit it, and save it to a file without ever needing the help of a programmer. As soon as the programmer is able to work on that part of the code again, he immediately has access to the AABB data the designer already edited, and can now concentrate on writing the code that actually makes use of that data.

And that concludes today’s post.
I’m curious: what do you use for component data? Serialization? Reflection? A data-driven approach? Let me know in the comments!

7 thoughts on “Schema-based entity-component data for fast iteration times

  1. This is actually an excellent workflow. This “component schemas” in addition to DataTranslator can really cut iteration time. I wonder if we can combine this with the ability to load binary data directly. So for development we will use component schemas to cut down iteration. But for final build we can turn all of data to binary and just load it to memory (without the need for parsing).

    • Exactly!
      The workflow you mentioned is the reason why I put in support for specialized spawner classes. You can easily use a setup where during development (and generally in debug and release builds), component data is always read using the generic approach, but in master builds there need to be spawners registered, otherwise the data cannot be loaded.
      One thing to keep in mind though is that you have to make sure that the specialized spawner classes are always up-to-date during development, otherwise master builds won’t work, and probably nobody will notice for a long time :).

  2. Hi, just want to mention my project which does something similar:
    https://github.com/martinscheffler/qtentity
    The schema defines max values and other attributes. Components can have properties containing lists of stuff, the availably types are defined in the schema.
    Components can be visualized in an editing widget, also they can be streamed to disk or network.

  3. So basically you create a new “struct” component type MotionBlurComponent with just loading that schema? or do you have the MotionBlurComponent struct/class already made in C++ and it just uses the schema to stream data in/out ?

    • It depends.
      For stable engine components like the motion blur component, the struct is defined in C++ and the schema (in conjunction with the data translator) knows how to apply values to that struct. For in-progress components, data gets accessed by name (hashed strings) at run-time. In conjunction with run-time compiled C++ code, you can change the schema, change the code for accessing the data, and everything still works without having to restart anything :).

  4. I’m still going through your other blogs, but if I understand this correctly, when you expose values this way you’re having to search for them by name from code. So something like:

    myComponent[“maxBlurRadius”] = 3.0;

    So does this remain in final release builds? Or do you convert these to a C++-side struct and make a pass through the code and change them to:

    myComponent.maxBlurRadius = 3.0;

    If you have the ability to runtime-recompile, can you not just have someone change the source code and add the new variable? Or are you unable to runtime-recompile enough of the engine classes?

    Another important issues that goes with serialization is versioning, so I’d be interested to hear how you handle any issues that come up with that as well.

    • I’m still going through your other blogs, but if I understand this correctly, when you expose values this way you’re having to search for them by name from code. So something like:

      myComponent[“maxBlurRadius”] = 3.0;

      So does this remain in final release builds? Or do you convert these to a C++-side struct and make a pass through the code and change them to:

      myComponent.maxBlurRadius = 3.0;

      In development builds, the lookup from “maxBlurRadius” to the real value is only done once, and stored inside a C++ struct. You as a user only ever see the structs, and no string-based lookups. Internally, it is a hash-based lookup, and uses the technique described in my other post:
      https://molecularmusings.wordpress.com/2014/01/23/translating-a-human-readable-json-like-data-format-into-c-structs/

      In retail builds, the struct would be loaded directly from binary data, not doing any hash-based lookups at all.

      If you have the ability to runtime-recompile, can you not just have someone change the source code and add the new variable? Or are you unable to runtime-recompile enough of the engine classes?

      You could do that, but that would break in case you’re already using the binary data. Adding a new value and recompiling the code would mean that the struct layout is now no longer compatible to the binary data the content pipeline created.
      With the hash-based lookups used in development builds, runtime re-compiles would work.

      Another important issues that goes with serialization is versioning, so I’d be interested to hear how you handle any issues that come up with that as well.

      The content pipeline automatically versions the data, incrementing the version each time the schema has changed. The runtime code can take appropriate measures, if required.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.