Hashed strings

One scenario that is quite common in all game engines is having to look-up some resource (texture, shader, material, script, etc.) based on a string, which could look like the following:

Texture* tex = TextureManager::FindTexture("my_texture");

Whenever we have to find a certain resource based on a name, we could do one of the following:

  1. Store each resource’s name, and compare it with the name of the desired resource using strcmp() (or equivalent).
  2. Hash each resource’s name, and compare it with the hash of the desired resource’s name.

Option 1 is clearly going to be a lot slower, so everybody should be using Option 2 these days. But even using hashes often does a lot of redundant work at run-time, especially when dealing with constant strings like the above “my_texture”. For example, TextureManager::FindTexture() could be implemented as follows:

Texture* TextureManager::FindTexture(const char* name)
{
  size_t hash = CalculateHash(name);
  // lookup the texture in some container using 'hash'
}

So even for constant strings, CalculateHash() will still be called, and the hash has to be generated each and every time this function is called with a constant string. This is something we would like to get rid of, and here’s how.

Instead of taking a ‘const char*’ as argument, we could define a simple class holding a hashed string which will be passed to the function, like so:

class HashedString
{
public:
  // implementation here
private:
  size_t m_hash;
};

Texture* TextureManager::FindTexture(HashedString hash)
{
  // no more hash calculations in here!
}

This way, our hashed string class can tackle the problem of distinguishing between constant and non-constant strings, and we can be sure that no hash calculations take place inside the function. But still, how do we make sure that constant strings are not re-hashed every time?

First, we need to come up with a mechanism of distinguishing between constant and non-constant strings. Luckily, this is something we discussed last time. Using what we ended up with, our initial HashedString interface could look something like this:

class HashedString
{
public:
  struct ConstCharWrapper
  {
    inline ConstCharWrapper(const char* str);
    const char* str;
  };

  template <size_t N>
  inline explicit HashedString(const char (&str)[N]);

  explicit HashedString(ConstCharWrapper str);

private:
  size_t m_hash;
};

Hashing strings is done using a simple FNV-1a hash with the following implementation:

namespace
{
  static size_t CalculateFNV(const char* str, size_t offset, size_t prime)
  {
    unsigned int hash = offset;
    while (*str != 0)
    {
      hash ^= *str++;
      hash *= prime;
    }

    return hash;
  }
}

HashedString::HashedString(ConstCharWrapper str)
  : m_hash(CalculateFNV(str.str, OFFSET, PRIME))
{
}

The above implementation takes care of hashing non-constant strings, so all that is left is hashing constant strings. Of course, we could just call CalculateFNV() for constant strings as well, but this doesn’t get rid of unnecessary re-hashing. Instead, we’ll provide template specializations for strings of different length, which tremendously helps the compiler/optimizer at folding constant expressions into the final hash.

Looking at the above FNV-1a hash implementation, let’s try to figure out the resulting hash values for known strings:

  • “” = offset
  • “a” = (offset^ ‘a’)*prime
  • “ab” = (((offset^ ‘a’)*prime)^ ‘b’)*prime
  • “abc” = (((((offset^ ‘a’)*prime)^ ‘b’)*prime)^ ‘c’)*prime)

Offset and prime are compile-time constants (I use 2166136261u and 16777619u, respectively), so the above expressions really are constant. All we need to do is give the compiler some help, which can be achieved by providing template specializations for different N’s:

template <>
inline HashedString::HashedString(const char (&str)[1])
  : m_hash(OFFSET)
{
}

template <>
inline HashedString::HashedString(const char (&str)[2])
  : m_hash((OFFSET^str[0])*PRIME)
{
}

// etc.

By providing inline template specializations, the optimizer is able to fold the constant expressions in the constructor initialization list into the resulting hash value during compile-time!

As you may already have guessed by spotting the ‘OFFSET^constant*PRIME’ pattern in the above examples, the specializations can be nicely put into macros:

#define ME_HASHED_STRING_1            OFFSET
#define ME_HASHED_STRING_2            ((ME_HASHED_STRING_1 ^ str[0]) * PRIME)
#define ME_HASHED_STRING_3            ((ME_HASHED_STRING_2 ^ str[1]) * PRIME)
#define ME_HASHED_STRING_4            ((ME_HASHED_STRING_3 ^ str[2]) * PRIME)
#define ME_HASHED_STRING_5            ((ME_HASHED_STRING_4 ^ str[3]) * PRIME)
// etc.

#define ME_HASHED_STRING_SPECIALIZATION(n)                                    \
  template <>                                                                \
  inline HashedString::HashedString(const char (&str)[n])                    \
    : m_hash(ME_JOIN(ME_HASHED_STRING_, n))                                \
  {                                                                            \
    ME_UNUSED(str);                                                            \
  }

ME_HASHED_STRING_SPECIALIZATION(1)
ME_HASHED_STRING_SPECIALIZATION(2)
ME_HASHED_STRING_SPECIALIZATION(3)
ME_HASHED_STRING_SPECIALIZATION(4)
ME_HASHED_STRING_SPECIALIZATION(5)
// etc.

Finally, let’s take a look at the generated assembly code to check whether our template trickery really worked:

void Test(HashedString hash)
{
  ME_LOG("Hash is: %d", hash);
}

// calling code for constant string
334:     HashedString hash1("Hello");
335:     Test(hash1);
00263D79 B8 4B 31 5C F5       mov         eax,0F55C314Bh
00263D7E 50                   push        eax
00263D7F 68 A4 35 27 00       push        offset string "Hash is: %d" (2735A4h)
...
00263DAD E8 4E EF FF FF       call        me::core::logDispatch::Log (262D00h)

// calling code for non-constant string
337:     std::string str("Hello");
338:     me::core::HashedString hash2(str.c_str());
339:     Test(hash2);
...
00263E07 8D 4D E8             lea         ecx,[hash2]
00263E0A E8 B1 09 00 00       call        HashedString::HashedString (2647C0h)
00263E0F 8B 55 E8             mov         edx,dword ptr [hash2]
00263E12 52                   push        edx
00263E13 68 A4 35 27 00       push        offset string "Hash is: %d" (2735A4h)
...
00263E3A E8 C1 EE FF FF       call        me::core::logDispatch::Log (262D00h)

As you can see, the compiler inserted the resulting hash value for “Hello” directly into the code:

mov         eax,0F55C314Bh

Voila! Mission accomplished, this works in MSVC as well as on console compilers. See the BitSquid blog for a different approach to hashing constant strings.

9 thoughts on “Hashed strings

  1. I am not sure I understand how you can have constructors for every length of string. Do you define the Macro a bunch(255) times? Is there some sort of Macro you use to create all the macros you may need for any variable string length? I think that is the only thing that is unclear to me about this method you are using. Also, just wondering if this could be implemented using CRC hashing? Currently, I use a python program in my pre-build step to find all the CRC macros and fill in the hashed values before compiling. This solution seems a little better though.

      • After working it out last night I was able to implement a version of the Compile time string hashing with templates. I think there may be an error though. When I look at the diassembly it doesnt show the Hash Value, it pushes the string onto the stack and calls the contructor and all. As if its doing it at runtime. I wonder if it has something to do with my CRC Lookup Table. Its a constant, but since the compiler doesnt know where it is in memory at compile time, I wonder if it can’t resolve the Hash. I notice the solution on Altdev uses constant values in its hash.

        If you have a free moment, any hints in the direction I might be going wrong would be incredibly helpful.

        Here is my implementation:

        Here is the disassembly:

  2. Ideas/comments from the top of my head:
    -) Try shorter strings (e.g. “abc”) and see whether that makes a difference. Maybe the optimizer can’t fold the expression into a constant for longer strings?

    -) Optimization level needs to be set to the highest. Furthermore, you need to force the compiler to inline the function – what does your RE_INLINE macro expand to? Which compiler are you using?

    -) The CRC lookup-table may be a problem, as you stated. I think I have an idea on how to “tell” the compiler that the table’s content is constant…

    I’ll have a closer look at it tomorrow.

    • I was trying a few different strings after I posted. One of them being just “a”.

      I didnt realize the project settings for optimization were important, although I probably should have. Haha. My RE_INLINE macro is defined as __forceinline.

      I thought the CRC table might be a problem. I am going to look into converting them into macros. Maybe as constants that would help things out a bit. That plus the optimization setting may be helpful. I am using MSVC 2010, btw.

      Thanks for your help. It is much appreciated.

    • After Looking over some of the code and thinking about it. I was able to make some successful changes. First off, you were right about my optimization settings. Once I had those set my constant crcLookUp table became a non issue. After that it required me to do some moving around of things changing some of the input template parameters to suit the CRC method. After that it was pretty easy.

      I will admit a few times I probably sent the compiler into an endless loop with some of the template parameter values I put in. Hahah. Fun stuff.

      One thing I noticed about this method versus a pre build step is that the pre build step will substitute the hext value in and the compiler will see it as a constant. This allows you to use Hashes in your switch statements( case _CRC(“XYZ”, 0xDEADBEEF): ). When I use the template method, putting the Hash in as a case value is a compile error (i.e case CRCHash(“XYZ”).GetHash(): ). Even though the compiler will eventually make it a constant, it is not seen as one when checking the switch cases for validity. Interesting. Not sure if there is a work around for that.

      • Glad it worked for you!

        You cannot use values generated with this code in a switch-statement because the result is not a true compile-time expression, it’s rather something that will be folded by the compiler/optimizer into a single constant. What the code essentially does is making it easier for the compiler to see optimization opportunities – that’s why optimization settings are so important.

        That being said, there’s no way to turn the above into a compile-time expression except using the C++11 keyword constexpr.

  3. Pingback: Translating a human-readable JSON-like data format into C++ structs | Molecular Musings

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.