Archive for October, 2015

Compile-time RNG tricks

Thursday, October 15th, 2015

(Start at the beginning of the series – and all the source can be found in my github repo)

Compile-time random number generation is quite useful. For instance, we could generate GUIDs (version 4 UUIDs):

namespace cx
{
  struct guid_t
  {
    uint32_t data1;
    uint16_t data2;
    uint16_t data3;
    uint64_t data4;
  };
 
  template <uint64_t S>
  constexpr guid_t guidgen()
  {
    return guid_t {
        cx::pcg32<S>(),
        cx::pcg32<S>() >> 16,
        0x4000 | cx::pcg32<S>() >> 20,
        (uint64_t{8 + (cx::pcg32<S>() >> 30)} << 60)
        | uint64_t{cx::pcg32<S>() & 0x0fffffff} << 32
        | uint64_t{cx::pcg32<S>()} };
  }
}
 
#define cx_guid \
  cx::guidgen<cx::fnv1(__FILE__ __DATE__ __TIME__) + __LINE__>

There are situations right now where one might have scripts (probably Python or some such language) in one’s build chain that do this type of generation of C++ code, for example to introduce randomness to each build in the name of security, or simply to mark the build uniquely. It would be nice not to have to maintain extra non-C++ code and extra steps in the build chain.

Is PCG32 secure? Maybe secure enough…

Security you say? We have a source of random numbers that changes every build… we could transparently “encrypt” string literals and decode them at point of use. Might be useful to someone. To wrangle with a string in a constexpr manner, I need to treat it as an array, and then encrypt (I’ll just xor) each character with my random byte stream. I can figure out how long a string literal is at compile time, store the random seed as a template parameter, and use an index_sequence expansion to do the encryption.

template <uint64_t S>
constexpr char encrypt_at(const char* s, size_t idx)
{
  return s[idx] ^
    static_cast<char>(pcg32_output(
      pcg32_advance(S, idx+1)) >> 24);
}
 
template <size_t N>
struct char_array
{
  char data[N];
};
 
template <uint64_t S, size_t ...Is>
constexpr char_array<sizeof...(Is)> encrypt(
    const char *s, std::index_sequence<Is...>)
{
  return {{ encrypt_at<S>(s, Is)... }};
}

That takes care of constructing an encrypted array from a string literal. The rest is just wrapping it up in an encrypted_string class and providing a sane runtime decryption function (we could use the existing C++11 constexpr functions, but they are strangely formulated for runtime use – maybe a more natural formulation would be easier to optimize). And we can give the class a conversion to string.

inline std::string decrypt(uint64_t S, const char* s, size_t n)
{
  std::string ret;
  ret.reserve(n);
  for (size_t i = 0; i < n; ++i)
  {
    S = pcg32_advance(S);
    ret.push_back(s[i] ^
      static_cast<char>(pcg32_output(S) >> 24));
  }
  return ret;
}
 
template <uint64_t S, size_t N>
class encrypted_string
{
public:
  constexpr encrypted_string(const char(&a)[N])
    : m_enc(encrypt<S>(a, std::make_index_sequence<N-1>()))
  {}
 
  constexpr size_t size() const { return N-1; }
 
  operator std::string() const
  {
    return decrypt(S, m_enc.data, N-1);
  }
 
private:
  const char_array<N-1> m_enc;
};
 
template <uint64_t S, size_t N>
constexpr encrypted_string<S, N> make_encrypted_string(
    const char(&s)[N])
{
  return encrypted_string<S, N>(s);
}
 
#define CX_ENCSTR_RNGSEED \
  uint64_t{cx::fnv1(__FILE__ __DATE__ __TIME__) + __LINE__}
 
#define cx_make_encrypted_string \
  cx::make_encrypted_string<CX_ENCSTR_RNGSEED>

The more observant and pedantic among you may have noticed that strictly, I’ve strayed into C++14 territory here, using std::index_sequence. But I think I’m still in the spirit of C++11 constexpr. And more importantly, VS2015 supports std::integer_sequence.

Anyway, let’s exercise this code and check it does the right thing:

int main(int, char* [])
{
  constexpr auto enc =
    cx_make_encrypted_string("I accidentally the string");
  cout << string(enc) << endl;
  return 0;
}

Here’s the result:

$ ./test_constexpr
I accidentally the string
$

And the plaintext string doesn’t appear in the binary:

$ objdump -s -j .rodata ./test_constexpr
 
./test_constexpr:     file format elf64-x86-64
 
Contents of section .rodata:
 404700 01000200 d31e337d 58244d8a 48e52178  ......3}X$M.H.!x
 404710 139d483c 5113d3f1 0aec79a3 80626173  ..H<Q.....y..bas
 404720 69635f73 7472696e 6700               ic_string.

Not bad. Per-compile obfuscation on string literals, with zero memory overhead and only a modest cost at the point of use.

Compile-time counters, revisited

Wednesday, October 14th, 2015

(Start at the beginning of the series – and all the source can be found in my github repo)

Some time ago I read a blog post showing how to make a compile-time counter: a constexpr function that would return monotonically increasing integers. When I first read it I didn’t really take the time to understand it fully, but now that I was on a compile-time computation kick, I decided to grok it fully.

Without getting too far into the nitty-gritty (go read the other blog post if you’re interested), the technique relies on using template instantiations to “bring-to-life” functions that affect future template instantiations. Thus we have a flag that declares (but does not yet define) a friend function:

template <int N>
struct flag
{
  friend constexpr int adl_flag (flag<N>);
};

And then a recursive reader function template that uses ADL and SFINAE to match against the (as-yet-undefined) friend function(s), bottoming out at zero:

template <int N, int = adl_flag(flag<N>{})>
constexpr int reader(int, flag<N>)
{
  return N;
}
 
template <int N>
constexpr int reader(
    float, flag<N>, int R = reader(0, flag<N-1>{}))
{
  return R;
}
 
constexpr int reader(float, flag<0>)
{
  return 0;
}

And finally, a writer that, when instantiated with the result of calling reader, instantiates the friend function, making the next call to reader terminate at one level higher:

template <int N>
struct writer
{
  friend constexpr int adl_flag (flag<N>)
  {
    return N;
  }
  static constexpr int value = N;
};

This is a curiosity, right? A foible of C++, a fairy tale told by wizened programmers to fresh graduates to simultaneously impress and revolt them, no? Could anything useful be done with this? Well, C++ is full of such tales, and it’s a short hop from can’t-look-away-revolting to established feature – after all, template metaprogramming was discovered pretty much by accident…

In fact, while at CppCon, I met up with Ansel and Barbara from CopperSpice, who are using a very similar technique to do away with the Qt Metaobject Compiler.

Max recursion depth, we meet again

My first thought was that this technique suffers at the hands of my old enemy, maximum recursion depth. In this case, maximum template instantiation depth, which despite a standard-recommended 1024, is frequently just 256 – lower than the recommended constexpr recursion depth of 512. So let’s do something about that.

Well, one quick-and-dirty way to do this is to compute the count in two halves: lower bits and upper bits, and then stick them together. When we reach the max on the lower bits, we’ll roll over one of the upper bits. So we have two flags representing the high and low, with the low flag also parameterized on the high bits:

template <int H, int L>
struct flag1
{
  friend constexpr int adl_flag1(flag1<H, L>);
};
template <int H>
struct flag2
{
  friend constexpr int adl_flag2(flag2<H>);
};

And two readers: the low bits reader is in a struct to avoid partial specialization of a function, because it’s effectively parameterized on the high bits as well as the low bits.

template <int H>
struct r1
{
  template <int L, int = adl_flag1(flag1<H, L>{})>
  static constexpr int reader(int, flag1<H, L>)
  {
    return L;
  }
  template <int L>
  static constexpr int reader(
      float, flag1<H, L>, int R = reader(0, flag1<H, L-1>{}))
  {
    return R;
  }
  static constexpr int reader(float, flag1<H, 0>)
  {
    return 0;
  }
};
 
template <int H, int = adl_flag2(flag2<H>{})>
constexpr int reader(int, flag2<H>)
{
  return H;
}
template <int H>
constexpr int reader(
    float, flag2<H>, int R = reader(0, flag2<H-1>{}))
{
  return R;
}
constexpr int reader(float, flag2<0>)
{
  return 0;
}

The low bits writer looks much the same as before, and the high bits writer is specialized on a bool indicating whether or not to instantiate the friend function, which we only do when the low bits roll over:

template <int H, bool B>
struct writehi
{
  friend constexpr int adl_flag2(flag2<H>)
  {
    return H;
  }
  static constexpr int value = H;
};
 
template <int H>
struct writehi<H, false>
{
  static constexpr int value = H;
};

The writer can then write both the high and low bits accordingly:

template <int H, int L>
struct writer
{
  static constexpr int hi_value =
    writehi<H+1, L == MAX>::value;
  static constexpr int lo_value =
    writelo<H, (L & BIT_MASK)>::value;
  static constexpr int value = (H << BIT_DEPTH) + L;
};

Using this approach we can easily increase the maximum number that we can get out of our counter from 256 to 16k or so, which is enough for one translation unit, for me.

Random acts of compiler abuse

Now, a counter is OK, as far as that goes, but something more useful might be nice… how about random numbers? But surely only a madman would try to implement a Mersenne Twister in C++11 constexpr-land. (This despite the fact that I did SHA256 string hashing.) No, these days when I think of random numbers, I think of Melissa O’Neill and her excellent PCG32. If you haven’t seen her video, go watch it. I’ll wait here. PCG32 has some distinct advantages for constexpr implementation:

  • It’s easy to implement
  • It’s fast
  • It’s not a lot of code (seriously, ~10 lines)
  • It’s easy to implement
  • It’s understandable
  • It’s easy to implement

Here’s a simple implementation of the whole 32-bit affair:

constexpr uint64_t pcg32_advance(uint64_t s)
{
  return s * 6364136223846793005ULL
    + (1442695040888963407ULL | 1);
}
 
constexpr uint64_t pcg32_advance(uint64_t s, int n)
{
  return n == 0 ? s : pcg32_advance(pcg32_advance(s), n-1);
}
 
constexpr uint32_t pcg32_xorshift(uint64_t s)
{
  return ((s >> 18u) ^ s) >> 27u;
}
 
constexpr uint32_t pcg32_rot(uint64_t s)
{
  return s >> 59u;
}
 
constexpr uint32_t pcg32_output(uint64_t s)
{
  return (pcg32_xorshift(s) >> pcg32_rot(s))
    | (pcg32_xorshift(s) << ((-pcg32_rot(s)) & 31));
}

And now we can use exactly the same pattern as the constexpr counter, except the writer, instead of giving us an integer, will give us a random number (and we also plumb through the random seed S as a template parameter):

template <uint64_t S, int H, int L>
struct writer
{
  static constexpr int hi_value =
    writehi<S, H+1, L == MAX>::value;
  static constexpr int lo_value =
    writelo<S, H, (L & BIT_MASK)>::value;
  static constexpr uint32_t value =
    pcg32_output(pcg32_advance(S, (H << BIT_DEPTH) + L));
};

There is probably a way to improve this by storing the actual PCG-computed value in a template instantiation, rather than simply using the integer to pump the PCG every time as I am doing. But it’s a proof-of-concept and works well enough for now. A simple macro will give us a suitable seed for our compile-time RNG by hashing a few things together:

#define cx_pcg32 \
  cx::pcg32<cx::fnv1(__FILE__ __DATE__ __TIME__) + __LINE__>

And now we have a compile-time RNG that is seeded differently every compile, every file, every line. There are potentially a lot of template instantiations – maybe using a lot of memory in the compiler – but we can do some useful things with this.

More string hashing with C++11 constexpr

Wednesday, October 14th, 2015

(Start at the beginning of the series – and all the source can be found in my github repo)

So FNV1 was easy, and Murmur3 wasn’t too much harder; for a challenge and to see how far I could go, I decided to try to compute an MD5 string hash using C++11 constexpr.

This was significantly harder. I broke out my copy of Applied Cryptography 2e, found a reference implementation of MD5 in C, read through RFC 1321 and the pseudocode on the Wikipedia page.

Few, few the bird make her nest

I built up MD5 piece by piece, pulling out parts of the reference implementation to check that I’d got each building block right before moving on. The actual round function primitives were the easy part. As usual for a hash function, they are a mixture of bitwise functions, shifts, rotates, adds. These types of things make for trivial constexpr functions, for example:

constexpr uint32_t F(uint32_t X, uint32_t Y, uint32_t Z)
{
  return (X & Y) | (~X & Z);
}
 
constexpr uint32_t rotateL(uint32_t x, int n)
{
  return (x << n) | (x >> (32-n));
}
 
constexpr uint32_t FF(uint32_t a, uint32_t b,
                      uint32_t c, uint32_t d,
                      uint32_t x, int s, uint32_t ac)
{
  return rotateL(a + F(b,c,d) + x + ac, s) + b;
}

There are similar functions for the other low-level primitives of the MD5 round functions, conventionally called F, G, H and I.

Now, MD5 works on buffer chunks of 512 bits, or 16 32-bit words. So assuming we have a string long enough, it’s easy, if a bit long-winded, to convert a block of string into a schedule that MD5 can work on:

struct schedule
{
  uint32_t w[16];
};
 
constexpr schedule init(const char* buf)
{
  return { { word32le(buf), word32le(buf+4), word32le(buf+8), word32le(buf+12),
        word32le(buf+16), word32le(buf+20), word32le(buf+24), word32le(buf+28),
        word32le(buf+32), word32le(buf+36), word32le(buf+40), word32le(buf+44),
        word32le(buf+48), word32le(buf+52), word32le(buf+56), word32le(buf+60) } };
}

Seconds out, round one

It’s not pretty, but such are the constructs that may arise when you only have C++11 constexpr to work with. The output of MD5 is going to be 4 32-bit words of hash (denoted A, B, C and D in the literature), and in the main loop, which happens for each 512 bits of the message, there are four rounds, each round having 16 steps. After each step, the 4 words are rotated so that A becomes the new B, B becomes the new C, etc. So a round step is fairly easy – here’s the round 1 step which uses the F primitive:

struct md5sum
{
  uint32_t h[4];
};
 
constexpr md5sum round1step(const md5sum& sum,
                            const uint32_t* block,
                            int step)
{
  return { {
      FF(sum.h[0], sum.h[1], sum.h[2], sum.h[3],
         block[step], r1shift[step&3], r1const[step]),
        sum.h[1], sum.h[2], sum.h[3]
        } };
}

As you can see, there are some constants (r1shift and r1const) for each stage of round 1. Rotating the words after each round step is also easy:

constexpr md5sum rotateCR(const md5sum& sum)
{
  return { { sum.h[3], sum.h[0], sum.h[1], sum.h[2] } };
}
 
constexpr md5sum rotateCL(const md5sum& sum)
{
  return { { sum.h[1], sum.h[2], sum.h[3], sum.h[0] } };
}

So now we are able to put together a complete round, which recurses, calling the round step function and rotating the output until we’re done after 16 steps.

constexpr md5sum round1(const md5sum& sum,
                        const uint32_t* msg, int n)
{
  return n == 16 ? sum :
    rotateCL(round1(rotateCR(round1step(sum, msg, n)), msg, n+1));
}

Rounds 2 through 4 are very similar, but instead of using the F primitive, they use G, H and I respectively. A complete MD5 transform for one 512-bit block looks like this (with a helper function that sums the MD5 result parts):

constexpr md5sum sumadd(const md5sum& s1, const md5sum& s2)
{
  return { { s1.h[0] + s2.h[0], s1.h[1] + s2.h[1],
        s1.h[2] + s2.h[2], s1.h[3] + s2.h[3] } };
}
 
constexpr md5sum md5transform(const md5sum& sum,
                              const schedule& s)
{
  return sumadd(sum,
                round4(
                    round3(
                        round2(
                            round1(sum, s.w, 0),
                            s.w, 0),
                        s.w, 0),
                    s.w, 0));
}

The ugly business of padding

So far so good. This works, as long as we’re processing the complete 512-bit blocks contained in the message. Now to consider how to finish off. The padding scheme for MD5 is as follows:

  • Append a 1-bit (this is always done, even if the message is a multiple of 512 bits)
  • Add as many 0-bits as you need to, to make up 448 bits (56 bytes)
  • Append a 64-bit value of the original length in bits to the 448 bits to make a final 512-bit block

This gets messy in C++11 constexpr-land, but suffice to say that I wrote a leftover function analogous to init that could deal with padding. Now, finally, the complete MD5 calculation, which has three conditions:

  1. As long as there is a 64-byte (512-bit) block to work on, recurse on that.
  2. If the leftover is 56 bytes or more, pad it without the length in there and recurse on an “empty block”.
  3. Otherwise, pad with padding and length, transform and we’re done!
constexpr md5sum md5update(const md5sum& sum, const char* msg,
                            int len, int origlen)
{
  return
    len >= 64 ?
    md5update(md5transform(sum, init(msg)), msg+64, len-64, origlen) :
    len >= 56 ?
    md5update(md5transform(sum,
                           leftover(msg, len, origlen, 64)),
              msg+len, -1, origlen) :
    md5transform(sum, leftover(msg, len, origlen, 56));
}

Woot!

I don’t mind admitting that when I finally got this working, I did a happy dance around my apartment. Even though the code has ugly parts. But of course the thrill of achievement soon gives way to the thirst for more, and MD5 isn’t exactly today’s choice for hash algorithms, even if it is still often used where cryptographic strength isn’t paramount. So I started thinking about SHA256.

Of course, cryptography proceeds largely by incrementally twiddling algorithms when they are found lacking: adding rounds, beefing up functions, etc. And so the NSA’s SHA series proceeded from Ron Rivest’s MD series. In particular, the block size and padding schemes are identical. That was a huge leg-up on SHA256, since I’d already solved the hardest part and I could reuse it.

SHA256 produces a larger digest, and obviously has different magic numbers that go into the rounds, but structurally it’s very similar to MD5. Again I used Wikipedia’s pseudocode as a reference. The only real thing that I needed to do over MD5, besides change up some trivial maths, was the schedule extend step. SHA256 copies the 16 words of the message to a 64-word block to work on, and extends the 16 words into the remaining 48 words. Computing a single value in the schedule, w[i], is easily written in a recursive style, where n represents the “real” values we already have, initially 16:

constexpr uint32_t extendvalue(const uint32_t* w, int i, int n)
{
  return i < n ? w[i] :
    extendvalue(w, i-16, n) + extendvalue(w, i-7, n)
    + s0(extendvalue(w, i-15, n)) + s1(extendvalue(w, i-2, n));
}

And to extend all the values, we can then do:

constexpr schedule sha256extend(const schedule& s)
{
  return { { s.w[0], s.w[1], s.w[2], s.w[3],
        s.w[4], s.w[5], s.w[6], s.w[7],
        s.w[8], s.w[9], s.w[10], s.w[11],
        s.w[12], s.w[13], s.w[14], s.w[15],
        extendvalue(s.w, 16, 16), extendvalue(s.w, 17, 16),
        //
        // and so on...
        //
        extendvalue(s.w, 62, 16), extendvalue(s.w, 63, 16) } };
}

When I did this initially, I ran into the compiler’s max step limit for a constexpr computation. Different from the max recursion limit, the max step limit is a guideline for how many expressions should be evaluated within a single constant expression. Appendix B suggests 220. If a compiler did memoization on extendvalue I suspect this would be fine, but evidently clang doesn’t, so to compromise, I split the function into 3: sha256extend16, sha256extend32 and sha256extend48, each of which extends by 16 words at a time. And that worked.

After extending the schedule and changing some of the maths functions, the rest was easy – practically the same as for MD5. Now I was done with string hashing.

For the next experiments with compile-time computation, I wanted to understand a strange thing I’d seen online…

C++11 compile-time string hashing

Tuesday, October 13th, 2015

(Start at the beginning of the series – and all the source can be found in my github repo)

Now that I was used to writing C++11-style constexpr, I decided to try some compile-time string hashing. It turns out that FNV1 is very easy to express in a move-down-the-string recursive style:

namespace detail
{
  constexpr uint64_t fnv1(uint64_t h, const char* s)
  {
    return (*s == 0) ? h :
      fnv1((h * 1099511628211ull) ^
           static_cast<uint64_t>(*s), s+1);
  }
}
constexpr uint64_t fnv1(const char* s)
{
  return true ?
    detail::fnv1(14695981039346656037ull, s) :
    throw err::fnv1_runtime_error;
}

That wasn’t difficult enough

So far so good. How about some other hash functions? On a previous project I had used MurmurHash, so I decided to give that a go.

(Coincidentally, a blog post has recently surfaced implementing 32-bit Murmur3 using C++14 constexpr. But I was sticking to C++11 constexpr.)

Implementing such a big function from cold in C++11 constexpr was bound to be difficult, so I scaffolded. I used the freely-available runtime implementation for testing, and initially I converted it to C++14 constexpr, which was fairly easy. After doing that, and with Wikipedia as a pseudocode guide I was ready to start breaking down (or building up) the functionality, C++11-style. At the base of 32-bit Murmur3 is the hash round function which is done for each 4-byte chunk of the key:

constexpr uint32_t murmur3_32_k(uint32_t k)
{
  return (((k * 0xcc9e2d51) << 15)
          | ((k * 0xcc9e2d51) >> 17)) * 0x1b873593;
}
 
constexpr uint32_t murmur3_32_hashround(
    uint32_t k, uint32_t hash)
{
  return (((hash^k) << 13)
          | ((hash^k) >> 19)) * 5 + 0xe6546b64;
}

And this can easily be put into a loop, with a helper function to constitute 4-byte chunks (the somewhat strange formulation of word32le here is because it is used differently in some other code):

constexpr uint32_t word32le(const char* s, int len)
{
  return
    (len > 0 ? static_cast<uint32_t>(s[0]) : 0)
    + (len > 1 ? (static_cast<uint32_t>(s[1]) << 8) : 0)
    + (len > 2 ? (static_cast<uint32_t>(s[2]) << 16) : 0)
    + (len > 3 ? (static_cast<uint32_t>(s[3]) << 24) : 0);
}
 
constexpr uint32_t word32le(const char* s)
{
  return word32le(s, 4);
}
 
constexpr uint32_t murmur3_32_loop(
    const char* key, int len, uint32_t hash)
{
  return len == 0 ? hash :
    murmur3_32_loop(
        key + 4,
        len - 1,
        murmur3_32_hashround(
            murmur3_32_k(word32le(key)), hash));
}

So this is the first part of Murmur3 that will process all the 4-byte chunks of key. What remains is to deal with the trailing bytes (up to 3 of them) and then finalize the hash. To deal with the trailing bytes, I chain the end functions (_end0 through _end3) and “jump in” at the right place. There is probably a more elegant way to do this, but…

constexpr uint32_t murmur3_32_end0(uint32_t k)
{
  return (((k*0xcc9e2d51) << 15)
          | ((k*0xcc9e2d51) >> 17)) * 0x1b873593;
}
 
constexpr uint32_t murmur3_32_end1(uint32_t k,
                                   const char* key)
{
  return murmur3_32_end0(
      k ^ static_cast<uint32_t>(key[0]));
}
 
constexpr uint32_t murmur3_32_end2(uint32_t k,
                                   const char* key)
{
  return murmur3_32_end1(
      k ^ (static_cast<uint32_t>(key[1]) << 8), key);
}
 
constexpr uint32_t murmur3_32_end3(uint32_t k,
                                   const char* key)
{
  return murmur3_32_end2(
      k ^ (static_cast<uint32_t>(key[2]) << 16), key);
}
 
constexpr uint32_t murmur3_32_end(uint32_t hash,
                                  const char* key, int rem)
{
  return rem == 0 ? hash :
    hash ^ (rem == 3 ? murmur3_32_end3(0, key) :
            rem == 2 ? murmur3_32_end2(0, key) :
            murmur3_32_end1(0, key));
}

Finalizing the hash is a very similar affair, with 3 stages:

constexpr uint32_t murmur3_32_final1(uint32_t hash)
{
  return (hash ^ (hash >> 16)) * 0x85ebca6b;
}
constexpr uint32_t murmur3_32_final2(uint32_t hash)
{
  return (hash ^ (hash >> 13)) * 0xc2b2ae35;
}
constexpr uint32_t murmur3_32_final3(uint32_t hash)
{
  return (hash ^ (hash >> 16));
}
 
constexpr uint32_t murmur3_32_final(uint32_t hash, int len)
{
  return
    murmur3_32_final3(
        murmur3_32_final2(
            murmur3_32_final1(hash ^ static_cast<uint32_t>(len))));
}

And that’s all there is to it: all that remains is to stitch the pieces together and provide the driver function:

constexpr uint32_t murmur3_32_value(const char* key, int len,
                                    uint32_t seed)
{
  return murmur3_32_final(
      murmur3_32_end(
          murmur3_32_loop(key, len/4, seed),
          key+(len/4)*4, len&3),
      len);
}
 
constexpr uint32_t murmur3_32(const char *key, uint32_t seed)
{
  return true ?
    murmur3_32_value(key, strlen(key), seed) :
    throw err::murmur3_32_runtime_error;
}

Doing strlen non-naively

One more thing: strlen crept in there. This is of course a constexpr version of strlen. The naive way to do this is to step down the string linearly, as seen in the FNV1 algorithm. With a max recursive depth of 512, I thought this fairly limiting… it might not be uncommon to have 1k string literals that I want to hash?

So the way I implemented strlen was to measure the string by chunks, constraining a max recursion depth of say 256. Instead of a plain linear recursion, it’s basically recursing for every 256 bytes of string length, and then recursing on the chunk inside that. So the max depth is 256 + 256 = 512, and we can deal with strings close to 64k in size (depending on how deep we are already when calling strlen). There’s a chance that strlen might be made constexpr in the future – I think it’s already implemented as an intrinsic in some compilers. So maybe I can throw away that code someday.

Anyway, now we have a complete implementation of 32-bit Murmur3 in C++11 constexpr style, and the tests go something like this (and can be checked online):

static_assert(murmur3_32("hello, world", 0) == 345750399,
              "murmur3 test 1");
static_assert(murmur3_32("hello, world1", 0) == 3714214180,
              "murmur3 test 2");
static_assert(murmur3_32("hello, world12", 0) == 83041023,
              "murmur3 test 3");
static_assert(murmur3_32("hello, world123", 0) == 209220029,
              "murmur3 test 4");
static_assert(murmur3_32("hello, world1234", 0) == 4241062699,
              "murmur3 test 5");
static_assert(murmur3_32("hello, world", 1) == 1868346089,
              "murmur3 test 6");

OK. That was a lot of code for a blog post. But after the trivial FNV1 and the only slightly harder Murmur3, I wanted more of a challenge. What about MD5? Or SHA-256? That’s for the next post.

More constexpr floating-point computation

Tuesday, October 13th, 2015

(Start at the beginning of the series – and all the source can be found in my github repo)

In the last post, I covered my first forays into writing C++11-style constexpr floating point math functions. Once I’d done some trig functions, exp, and floor and friends, it seemed like the rest would be a piece of cake. Not entirely…

But there was some easy stuff. With trunc sorted out, fmod was just:

constexpr float fmod(float x, float y)
{
  return y != 0 ? x - trunc(x/y)*y :
    throw err::fmod_domain_error;
}

And remainder was very similar. Also, fmax, fmin and fdim were trivial: I won’t bore you with the code.

Euler was quite good at maths

Now I come from a games background, and there are a few maths functions that get used all the time in games. I already have sqrt. What I want now is atan and atan2. To warm up, I implemented the infinite series calculations for asin and acos, each of which are only defined for an input |x| <= 1. That was fairly straightforward, broadly in line with what I'd done before for the series expansions of sin and cos.

Inverting tan was another matter: but a search revealed a way to do it using asin:

arctan(x) = arcsin(\frac{x}{\sqrt{x^2+1}})

Which I went with for a while. But then I spotted the magic words:

“Leonhard Euler found a more efficient series for the arctangent”

arctan(z) = \frac{z}{1+z^2} \sum_{n=0}^{\infty} \prod_{k=1}^n \frac{2kz^2}{(2k+1)(1+z^2)}

More efficient (which I take here to mean more quickly converging)? And discovered by Euler? That’s good enough for me.

namespace detail
{
  template <typename T>
  constexpr T atan_term(T x2, int k)
  {
    return (T{2}*static_cast<T>(k)*x2)
      / ((T{2}*static_cast<T>(k)+T{1}) * (T{1}+x2));
  }
  template <typename T>
  constexpr T atan_product(T x, int k)
  {
    return k == 1 ? atan_term(x*x, k) :
      atan_term(x*x, k) * atan_product(x, k-1);
  }
  template <typename T>
  constexpr T atan_sum(T x, T sum, int n)
  {
    return sum + atan_product(x, n) == sum ?
      sum :
      atan_sum(x, sum + atan_product(x, n), n+1);
  }
}
template <typename T>
constexpr T atan(T x)
{
  return true ?
    x / (T{1} + x*x) * detail::atan_sum(x, T{1}, 1) :
    throw err::atan_runtime_error;
}

And atan2 follows from atan relatively easily.

It’s big, it’s heavy, it’s wood

Looking over what I’d implemented so far, the next important function (and one that would serve as a building block for a few others) was log. So I programmed it using my friend the Taylor series expansion.

Hm. It turns out that a naive implementation of log is quite slow to converge. (Yes, what a surprise.) OK, a bit of searching yields fruit: “… use Newton’s method to invert the exponentional function … the iteration simplifies to:”

y_{n+1} = y_n + 2 \frac{x - e^{y_n}}{x + e^{y_n}}

“… which has cubic convergence to ln(x).”

Result! I coded it up.

namespace detail
{
  template <typename T>
  constexpr T log_iter(T x, T y)
  {
    return y + T{2} * (x - cx::exp(y)) / (x + cx::exp(y));
  }
  template <typename T>
  constexpr T log(T x, T y)
  {
    return feq(y, log_iter(x, y)) ? y : log(x, log_iter(x, y));
  }
}

It worked well. A week or so later, an article about C++11 constexpr log surfaced serendipitously, and on reading it I realised my implementation wasn’t very well tested, so I added some tests. The tests revealed numerical instability for larger input values, so I took a tip from the article and used a couple of identities to constrain the input:

ln(x) = ln(\frac{x}{e}) + 1
and
ln(x) = ln(e.x) - 1

Coding these expressions to constrain the input proper to detail::log seemed to solve the instability issues.

Are we done yet?

Now I was nearing the end of what I felt was a quorum of functions. To wit:

  • sqrt, cbrt, hypot
  • sin, cos, tan
  • floor, ceil, trunc, round
  • asin, acos, atan, atan2
  • fmod, remainder
  • exp, log, log2, log10

With a decent implementation of exp and log it was easy to add the hyperbolic trig functions and their inverses. The only obviously missing function now was pow. Recall that I already had ipow for raising a floating-point value to an integral power, but I had been puzzling over a more general version of pow to allow raising to a floating-point power for a while. And then I suddenly realised the blindingly obvious:

x^y = (e^{ln(x)})^y = e^{ln(x).y}

And for the general case, pow was solved.

template <typename T>
constexpr T pow(T x, T y)
{
  return true ? exp(log(x)*y) :
    throw err::pow_runtime_error;
}

I had now reached a satisfying conclusion to maths functions and I was getting fairly well-versed in writing constexpr code. Pausing only to implement erf (because why not?), I turned my attention now to other constexpr matters.

Floating-point maths, constexpr style

Tuesday, October 13th, 2015

(Start at the beginning of the series – and all the source can be found in my github repo)

To ease into constexpr programming I decided to tackle some floating-point maths functions. Disclaimer: I’m not a mathematician and this code has not been rigorously tested for numeric stability or convergence in general. I wrote it more for my own learning than for serious mathematical purposes, and anyway, it’s perhaps likely that constexpr maths functions will be in the standard library sooner or later (and possibly implemented by intrinsics). However, I did subject it to many ad-hoc tests using static_assert. One of the nice things about compile-time computation: the tests are compile-time too, and if it compiles, one can be reasonably sure it works (at least for the values tested)!

Anyway, I went to cppreference.com’s list of common mathematical functions to see what to tackle. Starting with abs is trivial:

Starting with abs is trivial:

template <typename T>
constexpr T abs(T x)
{
  return x >= 0 ? x :
    x < 0 ? -x :
    throw err::abs_runtime_error;
}

For the sake of clarity, I’m omitting some enable_if boilerplate here that ensures that the argument has the right type.

Iterative methods FTW

Now, other functions take a little thought. But remembering calculus, many of them are susceptible to iterative methods, so my general plan formed: formulate a recursive calculation which converges to the real answer, and terminate the recursion when the value is within epsilon of the answer.

For example, to compute sqrt, we can use the well-known Newton-Raphson method, where we start with a guess g0 as an approximation to √x, then:

g_{n+1} = \frac{(g_n + \frac{x}{g_n})}{2}

First, we need a function to determine termination, which I put in a detail namespace:

namespace detail
{
  template <typename T>
  constexpr bool feq(T x, T y)
  {
    return abs(x - y) <=
      std::numeric_limits<T>::epsilon();
  }
}

And then sqrt is straightforward to express (using the value itself as the initial guess), with a driver function handling the domain of the argument and the throw pattern, and delegating the recursion to a function in the detail namespace.

namespace detail
{
  template <typename T>
  constexpr T sqrt(T x, T guess)
  {
    return feq(guess, (guess + x/guess)/T{2}) ? guess :
      sqrt(x, (guess + x/guess)/T{2});
  }
}
template <typename T>
constexpr T sqrt(T x)
{
  return x == 0 ? 0 :
    x > 0 ? detail::sqrt(x, x) :
    throw err::sqrt_domain_error;
}

This ends up being a useful pattern for many functions. For example, cbrt is practically identical; only the equational details change. And once we have sqrt, hypot is trivial of course.

Next I tackled trigonometric functions. Remembering Taylor series and turning to that Oracle of All Things Maths, Wolfram Alpha, we can find the Taylor series expansions:

sin(x) = \sum_{k=0}^{\infty} \frac{(-1)^k x^{1+2k}}{(1+2k)!}
and
cos(x) = \sum_{k=0}^{\infty} \frac{(-1)^k x^{2k}}{(2k)!}

These are basically the same formula, and with some massaging, they can share a common recursive function with slightly different initial constants:

namespace detail
{
  template <typename T>
  constexpr T trig_series(T x, T sum, T n, int i, int s, T t)
  {
    return feq(sum, sum + t*s/n) ?
      sum :
      trig_series(x, sum + t*s/n, n*i*(i+1), i+2, -s, t*x*x);
  }
}
template <typename T>
constexpr T sin(T x)
{
  return true ?
    detail::trig_series(x, x, T{6}, 4, -1, x*x*x) :
    throw err::sin_runtime_error;
}
template <typename T>
constexpr T cos(T x)
{
  return true ?
    detail::trig_series(x, T{1}, T{2}, 3, -1, x*x) :
    throw err::cos_runtime_error;
}

Once we have sin and cos, tan is trivial of course. And exp has almost the same series as sin and cos – easier, actually:

e^x = \sum_{k=0}^{\infty} \frac{x^k}{k!}

Which yields in code:

namespace detail
{
  template <typename T>
  constexpr T exp(T x, T sum, T n, int i, T t)
  {
    return feq(sum, sum + t/n) ?
      sum :
      exp(x, sum + t/n, n * i, i+1, t * x);
  }
}
template <typename T>
constexpr T exp(T x)
{
  return true ? detail::exp(x, T{1}, T{1}, 2, x) :
    throw err::exp_runtime_error;
}

No peeking at the floating-point details

Now, how about floor, ceil, trunc and round? They are all variations on a theme: solving one basically means solving all, with only slight differences for things like negative numbers. I thought about this for a half hour – there wasn’t an obvious solution to me at first. We can’t trivially cast to an integral type, because there is no integral type big enough to hold a floating point value in general. And in the name of strict portability, we can’t be sure that the floating point format is IEEE 754 standard (although I think I am prepared to accept this as a limitation). So even if we could fiddle with the representation, portability goes out the window. Finally, constexpr (including C++14 constexpr) forbids reinterpret_cast and other tricks like access through union members.

So how to compute floor? After a little thought I hit upon an approach. In fact, I am relying on the IEEE 754 standard in a small way – the fact that any integer power of 2 is exactly representable as floating point. With that I formed the plan:

  1. Start with a guess of zero.
  2. Start with an increment: 2(std::numeric_limits::max_exponent – 1).
  3. If (guess + increment) is larger than x, halve the increment.
  4. Otherwise, add the increment to the guess, and that’s the new guess.
  5. Stop when the increment is less than 2.

This gives a binary-search like approach to finding the correct value for floor. And the code is easy to write. First we need a function to raise a floating point value to an integer power, in the standard O(log n) way:

namespace detail
{
  template <typename T>
  constexpr T ipow(T x, int n)
  {
    return (n == 0) ? T{1} :
      n == 1 ? x :
      n > 1 ? ((n & 1) ? x * ipow(x, n-1) :
                         ipow(x, n/2) * ipow(x, n/2)) :
      T{1} / ipow(x, -n);
  }
}

Then the actual floor function, which uses ceil for the case of negative numbers:

namespace detail
{
  template <typename T>
  constexpr T floor(T x, T guess, T inc)
  {
    return guess + inc <= x ? floor(x, guess + inc, inc) :
      inc <= T{1} ? guess : floor(x, guess, inc/T{2});
  }
}
constexpr float floor(float x)
{
  return x < 0 ? -ceil(-x) :
    x >= 0 ? detail::floor(
        x, 0.0f,
        detail::ipow(2.0f,
            std::numeric_limits<float>::max_exponent-1)) :
    throw err::floor_runtime_error;
}

The other functions ceil, trunc and round are very similar.

That’s some deep recursion

Now, there is a snag with this plan. It works well enough for float, but when we try it with double, it falls down. Appendix B of the C++ standard (Implementation quantities) recommends that compilers offer a minimum recursive depth of 512 for constexpr function invocations. And on my machine, std::numeric_limits<float>::max_exponent is 128. But for double, it’s 1024. And for long double, it’s 16384. Since by halving the increment each time, we’re basically doing a linear search on the exponent, 512 recursions isn’t enough. I’m prepared to go with what the standard recommends in the sense that I don’t mind massaging switches for compilers that don’t follow the recommendations, but I’d rather not mess around with compiler switches for those that do. So how can I get around this?

Well, I’m using a binary search to cut down the increment. What if I increase the fanout so it’s not a binary search, but a trinary search or more? How big of a fanout do I need? What I need is:

max_recursion_depth > max_exponent/x, where 2x is the fanout

So to deal with double, x = 3 and we need a fanout of 8. An octary search? Well, it’s easy enough to code, even if it is a nested-ternary-operator-from-hell:

template <typename T>
constexpr T floor8(T x, T guess, T inc)
{
  return
    inc < T{8} ? floor(x, guess, inc) :
    guess + inc <= x ? floor8(x, guess + inc, inc) :
    guess + (inc/T{8})*T{7} <= x ? floor8(x, guess + (inc/T{8})*T{7}, inc/T{8}) :
    guess + (inc/T{8})*T{6} <= x ? floor8(x, guess + (inc/T{8})*T{6}, inc/T{8}) :
    guess + (inc/T{8})*T{5} <= x ? floor8(x, guess + (inc/T{8})*T{5}, inc/T{8}) :
    guess + (inc/T{8})*T{4} <= x ? floor8(x, guess + (inc/T{8})*T{4}, inc/T{8}) :
    guess + (inc/T{8})*T{3} <= x ? floor8(x, guess + (inc/T{8})*T{3}, inc/T{8}) :
    guess + (inc/T{8})*T{2} <= x ? floor8(x, guess + (inc/T{8})*T{2}, inc/T{8}) :
    guess + inc/T{8} <= x ? floor8(x, guess + inc/T{8}, inc/T{8}) :
    floor8(x, guess, inc/T{8});
}

(Apologies for the width, but I really can’t make it much more readable.) For the base case, I revert to the regular binary search implementation of floor. What about for long double? Well, since max_exponent is 16384, x = 33 and we need a fanout of 233. That’s not happening! For long double, I have no choice but to revert to the C++14 constexpr rules:

constexpr long double floor(long double x)
{
  if (x < 0.0) return -ceil(-x);
  long double inc = detail::ipow(
      2.0l,
      std::numeric_limits<long double>::max_exponent - 1);
  long double guess = 0.0l;
  for (;;)
  {
    while (guess + inc > x)
    {
      inc /= 2.0l;
      if (inc < 1.0l)
        return guess;
    }
    guess += inc;
  }
  throw err::floor_runtime_error;
}

And that’s enough of such stuff for one blog post. Next time, I’ll go through the rest of the maths functions I implemented, with help from Leonhard Euler and Wikipedia.

Experimenting with constexpr

Tuesday, October 13th, 2015

Since seeing Scott Schurr at C++Now in Aspen and hearing his talks about constexpr, it’s been on my list of things to try out, and recently I got around to it. With the release of Visual Studio 2015, Microsoft’s compiler now finally supports C++11 style constexpr (modulo some minor issues), so it’s a good time to jump in.

Doing things the hard way

Now, I have no expectation that MS will provide for C++14 constexpr any time soon, since (as I understand it) it requires a fairly dramatic rework of the compiler front end. So although C++14’s relaxed constexpr is much easier to use than C++11’s relatively draconian version, I decided to stick as much as possible to C++11 to see what I could do. This basically means:

  • Functions are limited to a single return statement
  • Constructor bodies are empty – everything must be done in the initialization list

But I can do conditionals with the ternary operator, and I can call functions. That means in theory I can do anything! I like functional programming!

Scott’s talk covers three broad application areas of constexpr: floating-point computations, parsing and containers. So I started at the beginning.

I decided to put everything in the cx namespace, and make use of a trick I learned from Scott. One of the snags with constexpr is that it can be at the compiler’s discretion. Let’s say you write a constexpr function, and call it. Unless you use the result in a context that is required at compile time (such as a non-type template argument, array size, or assigned to a constexpr variable), the compiler is free not to do the work at compile time, but instead generate a normal runtime function. And in my experience, compilers aren’t aggressive about doing work at compile time.

From one point of view this is somewhat desirable – or at least, we want constexpr functions to be able to do double duty as compile-time and runtime functions. Well, that’s a stated goal, but as we shall see, writing C++11-friendly constexpr functions sometimes results in formulations that are very different from what we’d usually expect/want at runtime. But let’s assume that I write a nice constexpr function:

constexpr float abs(float x)
{
  return x >= 0 ? x : -x;
}

Now what if I make a mistake using this? What if I call this function and forget to use the result in a constexpr variable? The compiler’s going to be quite happy to emit the function, and I will end up with runtime computation that I don’t want.

Avoiding accidental runtime work

There’s no way I know of at compile time to insist that the compiler stay in constexpr-land. But we can at least turn a runtime problem into a link-time error with a little trick:

namespace err {
  namespace {
    extern const char* abs_runtime_error;
  }
}
 
constexpr float abs(float x)
{
  return x >= 0 ? x :
    x < 0 ? -x :
    throw err::abs_runtime_error;
}

This makes use of a couple of features of constexpr. First, throw is not allowed in constexpr functions (because what could it mean to throw an exception at compile time? Madness). But that doesn’t matter, because constexpr functions are effectively evaluated in a C++ interpreter, so as long as we never trigger the condition that causes evaluation of the throw, we’re OK. This compile-time interpreter also has a lot fewer features than the normal compiler that we’re used to, including (at least at time of writing) things like warnings for unreachable code and conditional expressions always being true, so we can even write things like:

constexpr float myFunc(float x)
{
  return true ? x :
    throw some_error;
}

And the compiler won’t make a sound.

The throw pattern

So what’s actually happening here? We’re getting two forms of protection. First, we can use this to enforce the domain, as we might with a square root function:

namespace err {
  namespace {
    extern const char* sqrt_error;
  }
}
 
constexpr float sqrt(float x)
{
  return x >= 0 ? sqrt_impl(x) :
    throw sqrt_error;
}

If we accidentally pass a negative number to sqrt here, it will try to evaluate throw in a constexpr function and we’ll get a compile error. So that’s nice.

The second, more insidious error that we avoid is the one where we accidentally turn a compile-time function into a runtime one. If we call sqrt without (for example) assigning the result to a constexpr variable, and the compiler emits sqrt as a runtime function, the linker will fail trying to find sqrt_error. A link error isn’t quite as good as a compile error, but it’s a lot better than a runtime problem that might not even be discovered!

The final oddity here arises when you consider that this code might be in a header. Since constexpr functions are (we hope) computed at compile time, they are implicitly inline, so that’s OK. But perhaps the stranger thing is that we have a symbol in an anonymous namespace in a header. This, by the way, is called out in the ISO C++ Core Guidelines as a no-no: “It is almost always a bug to mention an unnamed namespace in a header file.” But in this case, we don’t want the user to be able to define the symbol, and the potential for ODR-violations/multiple definitions isn’t there – the symbol is never supposed to be defined or used. Maybe we’ve found the one reason for that “almost”.

So those are some of the basics of building constexpr machinery. Next: my dive into floating-point maths at compile time.