A multitude of ray-tracers to choose from

I’ve recently had the need for a simple ray-tracer since I started working on a precomputed radiance transfer (PRT) baking tool. There’s multitudes of sample code, implementations and SDKs available, so I wanted to share my findings after a bit of research.

In order to precompute PRT lighting data, all I needed was a simple ray-tracer which allows me to test whether a ray hits any triangle in a scene, with the scene consisting of triangles only. No fancy path-tracing, no photon-mapping, no full-blown ray-tracing engine. Just a simple ray-scene hit-test.

I spent a day or two having a look at different available options, both CPU- and GPU-based:

  • Build the ray-tracer from scratch: All you need is an efficient bounding volume hierarchy (BVH), and an efficient ray-triangle intersection hierarchy. I’ve had code for a BVH lying around (built using the surface area heuristic), and a simple Möller-Trumbore intersection test was implemented in a matter of minutes. It worked, but was not nearly as fast as I would have liked it to be.
  • Arauna: Arauna is a heavily-optimized ray-tracer/path-tracer, which is able to perform several million ray-casts per seconds on my PC, with source code available. Unfortunately, ripping parts of the implementation out of Arauna seems to be a rather heavy undertaking – basically all code is sprinkled with SSE intrinsics, with little or no abstraction, and pure optimization goodness.
  • CUDA: GPU-based ray-tracers can outperform modern CPU ray-tracers if implemented correctly, hence I had a look at available CUDA ray-tracers. However, the amount of setup and extra steps needed in order to get something to run on the GPU is still quite annoying – iteration times are nowhere near as fast as working with C++ code only. I might look into a CUDA-based ray-tracer in the future (this one looks extremely promising, with source code available), but I’ve settled with a CPU-only solution for the time being (more on that later).
  • OptiX: NVIDIA’s ray-tracing engine built on top of CUDA is rather simple to use, and offers good interoperability with both OpenGL and Direct3D 11. In my experience, getting something to run with OptiX takes considerably less time than a pure CUDA-based solution, which was to my liking. However, OptiX suffers from the problem of “too much abstraction”, performing about four times as slow as comparable CUDA solutions. If going with a GPU-based solution, I would rather chose CUDA then.
  • Intel’s Embree: The last ray-tracer I looked at is the rather new Embree from Intel. I hadn’t heard about Embree yet, which got me curious to take a closer look. The examples were quite fast on my machine, and the source code is clean, concise and documented. You can learn a trick or two about SSE/AVX optimization just by taking a look at the Vec2/Vec3/Vec4 implementations. This is the option I settled with in the end.

Reasons why I chose Embree?

  • Embree uses the whole SSE2/3/4 instruction set and data structures built specifically for making good use of the vectorized instruction sets. As an example, Embree uses a bounding volume hierarchy which stores 4*N triangles per leaf, with each set of four triangles stored in a separate data structure which allows intersecting a single ray with four triangles simultaneously.
  • Embree does not use ray-packets for tracing coherent rays simultaneously, but rather has excellent data structures like the one mentioned above for tracing single, incoherent rays, which was exactly what I needed for Monte Carlo integration.
  • Integrating Embree into Molecule was painless and simple. SSE code, BVHs, and ray-BVH intersection code is nicely abstracted and separated, making it easy to just take those parts of the implementation you actually need.
  • Embree can deal with triangle meshes out-of-the-box. Simply create a BVH using one of the available builders, and you’re done.
  • Embree is fast. I slightly changed some parts of the implementation which I didn’t need (e.g. base interfaces/virtual functions for different intersectors), and utilized all available hardware threads on my PC (quad-core i7-2600K), resulting in a raw performance of 25 million ray-casts per second in a moderate game scene (~400k triangles).

What actually took me longest with integrating Embree into Molecule was that I would always get the following error message when building Molecule, but not when building Embree standalone:

ray.h(40): error C2208: 'float' : no members defined using this type

WTF? I cannot use a float because of… what?

The innocent-looking code that triggers this error is the following:

namespace embree
  /*! Ray structure. Contains all information about a ray including
   *  precomputed reciprocal direction. */
  struct Ray
    ///*! Default construction does nothing. */
    __forceinline Ray() {}

    ///*! Constructs a ray from origin, direction, and ray segment. Near
    // *  has to be smaller than far. */
    __forceinline Ray(const Vec3f& org, const Vec3f& dir, const float& near = zero, const float& far = inf)
      : org(org), dir(dir), rdir(1.0f/set_if_zero(dir,Vec3f(std::numeric_limits::min()))), near(near), far(far) {}

    Vec3f  org;     //!< Ray origin
    Vec3f  dir;     //!< Ray direction
    Vec3f  rdir;    //!< Reciprocal ray direction
    float near;    //!< Start of ray segment
    float far;     //!< End of ray segment

  /*! Outputs ray to stream. */
  inline std::ostream& operator<

Don’t read any further if you want to solve the puzzle yourself.

There’s a simple reason why the code doesn’t compile – again, it’s one of those stupid global namespace-polluting #defines which gets pulled in via the Windows.h header! If Microsoft only ever changes one thing in one of the future releases of Visual Studio, it should be Windows.h, also known as the header from hell.

Without further ado, a simple

#ifdef far
#    undef far

#ifdef near
#    undef near

solves the problem.

If you’ve successfully used other ray-tracers in the past, please let me know! I would very much like to hear about other solutions.

7 thoughts on “A multitude of ray-tracers to choose from

    • Just had a quick look, could be useful! Would be nice to have a performance-comparison with Intel’s Embree. Performance was one of the reasons why I chose Embree – if you need to raytrace lots of data on a daily basis during development, a faster raytracer really does pay off.

  1. I’ve used both OptiX and Embree in the past. Got about 5x better performance out of OptiX as compared with Embree, but then again there’s a fair amount of pain involved in shuffling data to the GPU and the other way around. Thing is with most monte-carlo algorithms at some point you want to have access to the intersection results on the CPU. Also, the Kd tree builder in OptiX can be painfully slow and uses alot of memory (again compared with Embree). Finally OptiX is NVidia Fermi only so that also limits the audience for your code. What I ended up doing was writing a wrapper that used OptiX when that was available and Embree as a hot-swappable fallback. Sometimes the GPU can run out of memory and you don’t want your code falling over in that case. And note, that OptiX doesn’t work over an Incredibuild render farm, OptiX contexts are not allowed on a remote node within an Incredibuild process.

    • Thanks for the information, good to know what works and what doesn’t. On which kind of NVIDIA hardware did you get a 5x speed-up? Geforce 5xx? 6xx? I haven’t looked into GPU ray-tracing yet, but will definitely do so in the future.

      • This was 580 hardware, didn’t try the newer 680 cards. But then the 5x is very back of the envelope. It really varies alot depending on the scene and the ray distribution you use. I expect that if you port more of the transport code to CUDA avoiding shuffling each intersection result back from the GPU you could get even better perfomance. Think, computing form-factors or AO and just shuffling back the final values.

    • Not significantly, but by a bit.

      Mostly kicked out stuff I didn’t need for certain calculations (like triangle IDs), made it work with my task scheduler. Still haven’t gotten around to make it use AVX and work with 8 triangles at once – should provide a nice speed-up on the CPU side.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.