Even though Molecule’s run-time engine exclusively uses binary files without doing any parsing, the asset pipeline uses a human-readable non-binary format for storing pretty much everything except raw asset files like textures or models. This post explains the process behind translating data from such a human-readable format into actual instances of C++ structs with very little setup code required.
Why a human-readable non-binary format? Isn’t that slower to load?
A non-binary format is of course slower to load than a binary one, but it is only used for files that are being used by the editor and the asset pipeline. As such, those files are constantly changed, so they have to be easy to read, diff, and merge – and users should be able to spot errors in the file structure almost immediately.
Why not XML?
Much has been said about XML and XML vs. JSON already. Personally, to me XML is a human-readable language, but not one that can be parsed easily by humans. I find it hard to parse larger pieces of XML just by looking at the contents without any visual aid like e.g. coloring or syntax highlighting. Other formats tend to be much easier to read, and have less overhead and visual clutter.
Why not JSON?
If you take a look at the JSON example on Wikipedia, JSON is still a bit to verbose for my taste (I certainly don’t want to put everything in quotation marks). Hence, I use a slightly altered format that is unambiguous to parse, supports objects as well as arrays, and is easily understood just by looking at an example.
Why not use an existing parser?
We programmers really don’t like re-inventing the wheel, especially in tools code where there is usually a bit more leeway for things like e.g. the number of memory allocations made. Still, I don’t want to make a hundred calls to new and delete for parsing a measly options file.
Sadly, that rules out most parsers already. Most of them tend to have individual classes for objects, arrays, values of different kinds, attributes, and more, calling new all over the place every time they encounter a new value, stuffing it into a big tree-like data-structure like a map.
I would like to have a parser that reads the file exactly once, parses in-place, uses no dynamic string allocations for parsing (the strings are already there, no need to create tons of std::string!), and puts everything into a somewhat generic data structure, holding values of objects and arrays inside a simple array. Additionally, I would like to go from this data structure to any C++ struct I like, and make the translation process as easy as possible.
Molecule’s data format
Let’s start with a simple example file that shows almost everything that can be put into a data file:
AnyObject = { # this is a comment stringValue = “any string” floatValue = 10.0f integerValue = -20 boolValue = true intArray = [ 10, 20, 30 ] stringArray = [ “str1”, “str2”, “str3” ] aNestedObject = { name = “object1” someValue = 1.0f } }
I would argue that just by looking at the contents of this file you can immediately parse all the information, without the need for any aid like syntax highlighting, color-coded keywords, etc.
Here is a more complete example: an options file specifying shader options with which to build a pixel shader for applying deferred point lights.
Options = { # general warningsAsErrors = false # debugging generateDebugInfo = false # flow control avoidFlowControl = false preferFlowControl = false # optimization optimizationLevel = 3 partialPrecision = true skipOptimization = false } # array of defines with their names and values Defines = [ # toggles between different PCF kernels (2x2, 3x3, 5x5, 7x7). 0 means no shadow mapping. { name = "ME_SHADOW_MAP_PCF_KERNEL" values = [0, 2, 3, 5, 7] } ]
Parsing
Parsing is quite simple, really.
The general rules are:
- Every time a ‘{‘ is encountered, the definition of a new object starts. ‘}’ closes the definition.
- Every time a ‘[‘ is encountered, the definition of a new array starts. ‘]’ closes the definition.
- Every time a ‘=’ is encountered and there are non-whitespaces to the right, add a new value to the current object or array.
- Ignore everything after a ‘#’.
The parser essentially just parses the whole file line-by-line, and keeps track of its current state. Individual lines and values are parsed by using fixed-size strings on the stack. Disambiguating different value types is also straightforward if you try to identify types in the following order:
- If the value starts with a ‘”’, it must be a string. Else, go to 2.
- If the value ends with an ‘f’ or contains a ‘.’, it must be a float because we ruled out strings already. Else, go to 3.
- If the value starts with either ‘t’ (true) or ‘f’ (false), it must be a bool. Else, go to 4.
- The type is an integer.
Implementing the parser yourself also has the added benefit that error checking can be made a bit more robust with meaningful error messages. For example, telling the user that the parser “encountered an error in line 8” isn’t very helpful, and we can do so much better than that.
How does “Malformed data: Array was opened without preceding assignment in line 8. Did you forget a ‘=’?” sound? Much better.
All in all, the parser weights in at around 300 lines of C++ code, including comments, asserts, and error messages.
Generic data structure
Parsing is one thing, but how do we hold the values in memory? I settled for a straightforward implementation, using the following classes:
- DataBin: holds any number of DataObject and DataArray (both stored in an array)
- DataObject: holds any number of DataArray and DataValue (both stored in an array)
- DataArray: holds values or objects (stored in an array)
- DataValue: can either be a string, a float, an integer, or a boolean value (stored using a union)
Memory for the class instances is simply allocated using a linear allocator, so the data for a whole object or array is always contiguous in memory. This creates no fragmentation, you can allocate a reasonably sized buffer once and use that for all parsing operations (resetting the allocator after parsing a file has finished), and it also helps with the next step: translating the data into C++ structs.
Translating generic data into C++ structs
What do I mean by translation? We don’t want to access individual pieces of data using a name-based lookup as in the following example:
const bool value = dataBin.GetValue(“Options/warningsAsErrors”);
There are several reasons why I try to stay away from such an approach:
- It is error-prone. Every time you want to access a value, you risk misspelling it, and accesses are probably going to happen from several different .cpp files, making it harder to find the culprit.
- Every time somebody wants to access a value, it has to be retrieved from the generic data structure, which is some kind of search operation. That can be sped up by using hashes, binary search, or similar – but why do it on each access operation if we don’t have to?
- We want to be able to pass objects around to other functions. In many cases, we want to throw away the contents of a file and close the file handle in the meantime, and only keep the parts we need.
What we want is something like the following:
struct PixelShaderOptions { bool warningsAsErrors; bool generateDebugInfo; bool avoidFlowControl; bool preferFlowControl; int optimizationLevel; bool partialPrecision; bool skipOptimization; }; // assume dataBin is filled with generic data by the parser DataBin dataBin; PixelShaderOptions options; // translates the object named “Options” from dataBin, putting all data into the given struct according to the translator TranslateObject("Options", dataBin, translator, &options);
After the data has been translated, we can simply access it via e.g. options.optimizationLevel. We can pass it around to other functions and throw away the file in the meantime. Accessing values is fast, and checked at compile-time.
The remaining question is: how do we build such a generic translator using standard, portable C++? What info does the translator hold? What we need is a list that specifies which value goes into what struct member. In our case, struct members can be a std::string, a std::vector, and other non-POD types, so using offsetof is not an option.
C++ Pointer-to-member
Pointers-to-members is one of those C++ features that get used every once in a blue moon. Personally, this was the second time I used pointers-to-members in the last ten years of C++ programming.
What exactly is a pointer-to-member? In layman terms, a pointer-to-member in C++ allows you to refer to non-static members of class objects in a generic way, which means that you can e.g. store a pointer to a std::vector member, and assign values to members of any class instance using that pointer.
Similar to pointers-to-member-functions, those pointers also exhibit awkward syntax, and in order to access a member’s value of a class instance, you have to use either the .* or the ->* operator.
Using pointers-to-members, we can store a list of the names of values that we want to translate, along with their pointer-to-member. That is, we could do something like the following:
DataTranslator<PixelShaderOptions> translator; translator.Add(“warningsAsErrors”, &PixelShaderOptions::warningsAsErrors); translator.Add(“optimizationLevel”, &PixelShaderOptions::optimizationLevel); // and so on...
The DataTranslator is a simple class template that holds pointers-to-members of a certain type, as shown in the following example:
template <class T> class DataTranslator { // single members typedef bool T::*BoolMember; typedef int T::*IntMember; typedef float T::*FloatMember; typedef std::string T::*StringMember; // array members typedef std::vector<bool> T::*BoolArrayMember; typedef std::vector<int> T::*IntArrayMember; typedef std::vector<float> T::*FloatArrayMember; typedef std::vector<std::string> T::*StringArrayMember; public: DataTranslator& Add(const char* name, BoolMember member); DataTranslator& Add(const char* name, IntMember member); DataTranslator& Add(const char* name, FloatMember member); // other overloads omitted };
For storing the pointer-to-member internally, we can either use one array containing some kind of pointer-to-member-variant (that we have to build ourselves first), or use separate arrays for storing the pointers to different types. In Molecule, I chose the latter approach because it makes data translation faster – it only touches the data it needs for translating a member of a certain type. For example, when translating an integer value, we don’t need to look at all the other members, but only at the pointers-to-int members.
One additional common trick we can use in the DataTranslator interface is to return a reference to ourselves in each Add() method. This allows us to define static, immutable translators like in the following example:
const DataTranslator<PixelShaderOptions> g_pixelShaderOptionsTranslator = DataTranslator<PixelShaderOptions>() .Add("warningsAsErrors", &PixelShaderOptions::warningsAsErrors) .Add("generateDebugInfo", &PixelShaderOptions::generateDebugInfo) .Add("avoidFlowControl", &PixelShaderOptions::avoidFlowControl) .Add("preferFlowControl", &PixelShaderOptions::preferFlowControl) .Add("optimizationLevel", &PixelShaderOptions::optimizationLevel) .Add("partialPrecision", &PixelShaderOptions::partialPrecision) .Add("skipOptimization", &PixelShaderOptions::skipOptimization);
Translating an object is now as simple as walking the DataObject that we are being given, and matching each DataValue stored in the DataObject with the corresponding pointer-to-member stored in the translator. Using some kind of hashing scheme such as quasi compile-time string hashes for the value names, we only have to search through our array of pointers-to-members corresponding to the type of the DataValue, which is a very fast operation.
Conclusion
The data format introduced in this post is used all over the place in Molecule’s asset pipeline. It is used for asset compiler options, resource packages, components, entities, and schemas. Schemas are an interesting thing when used in conjunction with an entity-component-architecture, and will be the topic of upcoming posts.
I have a very similar file format in place, except that I don’t use ‘=’ signs at all.
Plus, instead of a specialized ‘allocator’, I just have a single ‘growing’ (starting with a reasonable size, 2048 chars, never shrinking) “token string” that can be used to parse everything.
If no ‘line’ is ‘longer’ than 2048, it will never ‘allocate’ from ‘heap’.
Anyway, I do feel like this approach is a very reasonable way to go 🙂
The specialized (linear) allocator is used for allocating the objects and arrays inside the DataBin, DataObject, and DataArray. All “strings” used for parsing are essentially nothing more than character arrays on the stack (I use a so-called FixedSizeString for that).
Yes, I never want to go back to XML :).
Sorry for arriving late to the game but TOML has existed for a while now:
https://github.com/mojombo/toml
Great post again, Stefan! How do you determine the number of bytes you need to allocate in the linear allocator? I checked your link and it seems your linear allocator can ot grow, so I wonder how you determine a reasonable upper bounds for the required memory.
One option would be to make the linear allocator grow (similar to the growing stack-like allocator). However, this is tools/asset-pipeline code, so I allocate 1 MB once and be done with it :). The files are all small, 1 MB is more than enough.
I see! Do you recommend against vertex data in this format? I wonder how we should store it? I see two options:
# Option 1 (nested)
Vertices =
[
V1 = [ 1.0, 2.0, 3.0 ]
V2 = [ 4.0, 5.0, 6.0 ]
…
]
# Option 2 (flat)
Vertices = [ 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, … ]
I am not necessarily thinking about models, but more about physics hulls and meshes where this might be useful and the meshes these days can be large (at least during production). I think Collada had to make a similar decision and for large scenes parsing Option 1 might be a hit due to heavy copying. The second approach would parse into a linear array which we just cast to a float or vector array. It is kind of readability vs. performance. What is your take on this? How do you see this file format working for larger files?
If at all, I would definitely choose Option 2. However, I would generally try not to load huge amounts of data this way, but rather export it from the corresponding tool (Maya, ZBrush, …) in a binary format. But I understand that for certain things this means writing your own exporter, which can be tedious.
Where do you get the data from?
I export this data from Maya. My plug-in has a custom shape node which can be either a hull or mesh. Ideally I plan to export my custom nodes into this format and then use the Maya referencing to create these nodes myself when loading. This will allow me to make changes in other tools and will show up again inside Maya. I am heavily experimenting here… 🙂
But can’t you use a binary format then? Sorry, I’m not exactly a Maya expert.
Usually the hulls and meshes are compiled into a binary format in a second step while the higher level rigid body and joint parameters are compiled into some physics component. Are you suggesting to only export the high level information (e.g. mass properties, transforms, joint parameters) into a readable JSON format and export hulls and meshes into a binary format?
This is an interesting idea. But as a result a single export might contain many small files, which we could of course bundle somehow I guess. Still, how about the extra complexity dealing with many files and referencing them instead of keeping things together?
Having to deal with multiple files instead of a simple 1:1 relationship always increases complexity, that’s right. Depending on how long the data gets, you could either store it in a simple array as shown in my post, or maybe add support for Base64-encoded data to your file format? Using Base64, you could easily load binary data from your JSON-like format and store it in an array internally. However, you loose editability and readability for that kind of data – don’t know how important it is to read and edit hulls & meshes in a text file (probably not so much).
I think I am missing something, but how does the translator work for objects with nested objects or nested arrays of objects? E.g. assume we have a transform component which has scale, rotation and translation and also a bounding sphere which is itself another object with center and radius.
The DataObject just needs to be able to hold DataObjects as well. It should be straightforward to add that to the reader and the DataObject implementation. Are there specific problems you did encounter?
Sure, I understand that the data object can have other nested data objects. I wonder how the translator work? I kind of see me now writing translators for each data object that then get nested. How would the data translator for the following file look:
Player =
{
Position = [ 1, 2, 3 ]
Orientation = [ 0, 0, 0, 1 ]
BoundingSphere =
{
Center = [ 0, 1, 0 ]
Radius = 0.5
}
}
I see two options for dealing with this:
1) Build yourself some kind of generic pointer-to-data-member that is able to store pointers-to-members of arbitrary types in arbitrary classes. It’s possible using type erasure and templated implementation classes that derive from a common base class, but can get nasty fast (similar to how you would build a generic delegate, for example).
2) Identify nested members by giving them corresponding names in the translator, e.g. “Position”, “Orientation”, and “BoundingSphere/Center”, “BoundingSphere/Radius”. This would be much easier, and the translator only needs to know about ints, floats, bools, etc.
Pingback: Schema-based entity-component data for fast iteration times | Molecular Musings
How do you handle allocations in the array or object in your model? Do you just reallocate the array as it grows (you might waste a lot of memory when using linear allocator) or store the individual values as intrusive linked list (won’t always be continuous)?
How do you handle strings in your parser? Do you point to the original buffer (and modify it in place) or read the string line-by-line and just allocate the memory and copy them?
The array doesn’t need to grow. You know exactly how much you need to allocate when parsing the file, it’s just a matter of counting the number of elements (or parentheses depending on what you’re looking for), which is pretty fast because the whole file should be in memory anyway. Alternatively, you can use a linear allocator for allocating objects, and a simple std::vector for arrays. I’m not too worried about losing a few bytes here and there in tools-only code.
I allocate the memory and copy them. By the time you use the resulting structs the file will already be closed and its contents gone.
You wrote in your article, that you parse the file exactly once – which contradicts the idea of counting objects/elements. Besides, it’s not always so easy/fast to count elements – they may contain other objects, which in turn may contain arrays etc. Parsing such nested structure all over again can become slow.
Anyway, thanks for your ideas.
Even if objects contain other elements and/or arrays (or any other kind of nested structure), you’re not interested in that when counting elements. For example, if you want to know how many objects there are at the root, just count the number of “{” – “}” pairs at the same level. For each “{” you increase a counter, and for each “}” you decrease it – this keeps track of the level you’re at. Each “}” encountered on the same level increases the number of objects found at the root level.
Counting the number of objects or array elements this way takes a few milliseconds, at most (for the whole file).