Archive for the ‘C++’ Category

Development of an Algorithm

Wednesday, June 21st, 2017

Here’s an exercise: given a nice piece of code sitting in a file, how do you take that code and make it generic, in the style of an STL algorithm?

For our example, let’s consider an algorithm that isn’t (yet) in the STL. First, the problem it solves. Imagine that you have a set (in the mathematical sense, not the container sense) of numbers; discontiguous, unordered, and notionally taken from the positive integers, e.g.:

\{5,2,7,9,4,0,1\}

What we want to find is the smallest number that isn’t included in the set. For the example given above, the answer is 3. If the set is sorted, the solution is apparent; adjacent_find will find the first gap for us. So one solution would be:

1
2
3
4
5
6
7
8
// assuming our set is in an array or vector
std::sort(a.begin(), a.end());
auto p = std::adjacent_find(
    a.begin(), a.end(), 
    [] (int x, int y) { return y-x > 1; });
int unused = a.back() + 1;
if (p != a.end())
  unused = *p + 1;

The asymptotic complexity of this solution is of course O(n.log(n)), since adjacent_find is linear and the complexity of sort dominates. Is there a better solution? It’s not obvious at first glance, but there is a linear time solution to this problem. The key is a divide and conquer strategy using that workhorse of algorithms, partition.

The smallest unused number

Let’s start by assuming that the minimum unused value is zero. We’ll choose a pivot value, m, which is the assumed midpoint of the sequence – it doesn’t actually have to be present in the sequence, but we’re going to simply assume that it’s the middle value. Partitioning the sequence about the pivot, we’ll get a situation like this:
Partitioned sequence
where the first value equal to or greater than m is at position p (which is returned by the call to partition). If the lower half of the sequence has no gaps, m will be equal to p, and we can recurse on the top half of the sequence, setting the new minimum unused value to m.

We know that m cannot be less than p since there cannot be any repetitions in the set. Therefore, if m is not equal to p, it must be greater – meaning that there is at least one gap below p, and that we can recurse on the bottom half of the sequence (keeping the minimum unused value as-is).

The base case of the algorithm is when the sequence we need to partition is empty. At that point we will have found the minimum unused value. Here’s the algorithm in code, with a bit of setup:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
void min_unused()
{
  // initialize RNG
  std::array<int, std::mt19937::state_size> seed_data;
  std::random_device r;
  std::generate_n(seed_data.data(), seed_data.size(), std::ref(r));
  std::seed_seq seq(std::begin(seed_data), std::end(seed_data));
  std::mt19937 gen(seq);
 
  // fill an array with ints, shuffle, discard some
  int a[10];
  std::iota(&a[0], &a[10], 0);
  std::shuffle(&a[0], &a[10], gen);
  int first_idx = 0;
  int last_idx = 7; // arbitrary truncation
 
  for (int i = first_idx; i < last_idx; ++i)
    std::cout << a[i] << '\n';
 
  // the algorithm
  int unused = 0;
  while (first_idx != last_idx) {
    int m = unused + (last_idx - first_idx + 1)/2;
    auto p = std::partition(&a[first_idx], &a[last_idx],
                            [&] (int i) { return i < m; });
    if (p - &a[first_idx] == m - unused) {
      unused = m;
      first_idx = p - &a[0];
    } else {
      last_idx = p - &a[0];
    }
  }
 
  std::cout << "Min unused: " << unused << '\n';
}

You can also see the algorithm on wandbox. I leave it to you to convince yourself that the algorithm works. The asymptotic complexity of the algorithm is linear; at each stage we are running partition, which is O(n), and then recursing, but only on one half of the sequence. Thus, the complexity is:

O(n + \frac{n}{2} + \frac{n}{4} + \frac{n}{8} + ...) = O(\sum_{i=0}^\infty \frac{n}{2^i}) = O(2n) = O(n)

From concrete to generic: first steps

So, now we have an algorithm that works… for C-style arrays, and in one place only. Naturally we want to make it as generic as we can; what, then, are the steps to transform it into a decent algorithm that can be used just as easily as those in the STL? Perhaps you’re already thinking that this is overly concrete – but hey, this is a pedagogical device. Bear with me.

First, we can pull the algorithm out into a function. You’ve already seen the setup code, so you should be able to imagine the calling code from here on. While we’re at it, let’s change the index-based arithmetic to pointers. We can pass the sequence in to the function as an argument; for now, we may as well say that the sequence is a vector, because vector is always the right default choice! Our updated function:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// version 1: a function that takes a vector
int min_unused(std::vector<int>& v)
{
  int* first = &v[0];
  int* last = first + v.size();
  int unused = 0;
  while (first != last) {
    int m = unused + (last - first + 1)/2;
    auto p = std::partition(first, last,
                            [&] (int i) { return i < m; });
    if (p - first == m - unused) {
      unused = m;
      first = p;
    } else {
      last = p;
    }
  }
  return unused;
}

This is already looking a little clearer. Using pointers removes some of the syntactic noise and allows us to see what’s going on better. But, as you may already have thought, once we have vectors, the next logical step is to consider using iterators. Let’s make that change now so that we can run the algorithm on a subrange within any vector.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// version 2: use iterators, like a real algorithm
template <typename It>
inline int min_unused(It first, It last)
{
  int unused = 0;
  while (first != last) {
    int m = unused + (last - first + 1)/2;
    auto p = std::partition(first, last,
                            [&] (int i) { return i < m; });
    if (p - first == m - unused) {
      unused = m;
      first = p;
    } else {
      last = p;
    }
  }
  return unused;
}

Now instead of using the vector directly, now we can pass in begin() and end(), or any other valid range-delimiting pair of iterators. Note also that because min_unused is now a function template, we added inline on line 3, meaning that we can put it in a header without causing link errors when it’s instantiated the same way in multiple translation units.

This is a good start, but we can make it more generic yet! At the moment it’s still only working on sequences of ints, so let’s fix that.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// version 3: infer the value_type
template <typename It>
inline typename std::iterator_traits<It>::value_type min_unused(It first, It last)
{
  using T = typename std::iterator_traits<It>::value_type;
  T unused{};
  while (first != last) {
    T m = unused + (last - first + 1)/2;
    auto p = std::partition(first, last,
                            [&] (const T& i) { return i < m; });
    if (p - first == m - unused) {
      unused = m;
      first = p;
    } else {
      last = p;
    }
  }
  return unused;
}

Here we’re using iterator_traits to discover the value_type of the iterators passed in. This works with anything that satisfies the Iterator concept. Consequently we’ve also now removed the ints that were scattered through the code and instead are using Ts. Note that we are value-initializing the T on line 6, which will properly zero-initialize integral types.

Since we aren’t actually calculating the initial value of unused, we need not assume it to be zero; we could let the caller pass it in, and default the argument for convenience. In order to do that, we’ll pull out T into a defaulted template argument.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// version 4: let the user supply the initial minimum
template <typename It,
          typename T = typename std::iterator_traits<It>::value_type>
inline T min_unused(It first, It last, T init = T{})
{
  while (first != last) {
    T m = init + (last - first + 1)/2;
    auto p = std::partition(first, last,
                            [&] (const T& i) { return i < m; });
    if (p - first == m - init) {
      init = m;
      first = p;
    } else {
      last = p;
    }
  }
  return init;
}

A caller-provided minimum might be quite handy – it’s common for zero to be a reserved value, or to have a non-zero starting point for values.

Iterators again

At this point, min_unused is starting to look more like a real STL algorithm, at least superficially, but it’s not quite there yet. Lines 7 and 10 are doing plain arithmetic, which assumes that the iterators passed in support that. A fundamental concept in the STL is the idea of iterator categories: concepts which define operations available on different kinds of iterators. Not all iterators support random access, and any time we are writing a generic algorithm, we had better use the appropriate functions to manipulate iterators rather than assuming valid operations:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// version 5: handle iterator arithmetic generically
template <typename It,
          typename T = typename std::iterator_traits<It>::value_type>
inline T min_unused(It first, It last, T init = T{})
{
  while (first != last) {
    T m = init + (std::distance(first, last)+1)/2;
    auto p = std::partition(first, last,
                            [&] (const T& i) { return i < m; });
    if (std::distance(first, p) == m - init) {
      init = m;
      first = p;
    } else {
      last = p;
    }
  }
  return init;
}

In the code above, the arithmetic on lines 7 and 10 has been replaced with calls to distance. This is a key change! Now we are no longer limited to random access iterators like those of vector or just plain pointers. Our algorithm now works on all kinds of iterators, but which ones exactly? It behooves us to define exactly what we require of these iterators being used as our arguments. Until C++ has Concepts, we can’t enforce this, but we can document it, typically by calling the template argument something more descriptive than ‘It‘. Looking at the references for distance and partition, we find that a forward iterator is required.

For many algorithms in the STL, there are more efficient versions available for stronger iterator categories. A good example is partition itself: with just a forward iterator, it may do N swaps, but with a bidirectional iterator, it need only perform, at most, N/2 swaps. Wherever we want to use different algorithms for different iterator categories, overloading with tag dispatch is a common technique; the STL provides iterator tag types for use as arguments to these overloads.

But with min_unused, we don’t have any differences in the top level algorithm that would give us greater efficiency based on iterator category. Both distance and partition do that work themselves.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// version 6: document the relaxed iterator category
template <typename ForwardIt,
          typename T = typename std::iterator_traits<ForwardIt>::value_type>
inline T min_unused(ForwardIt first, ForwardIt last, T init = T{})
{
  while (first != last) {
    T m = init + (std::distance(first, last)+1)/2;
    auto p = std::partition(first, last,
                            [&] (const T& i) { return i < m; });
    if (std::distance(first, p) == m - init) {
      init = m;
      first = p;
    } else {
      last = p;
    }
  }
  return init;
}

Strength reduction

Now is also a good time to perform some manual strength reduction on our code, making sure that each operation used is the weakest option to do the job. This doesn’t really apply to any operation instances in min_unused, but in practical terms it typically means avoiding postfix increment and decrement. If the algorithm implements more than trivial mathematics it may also require quite a bit of decomposition to discover all potential efficiencies and opportunities to eliminate operations. For more examples, I recommend watching any and all of Alex Stepanov’s excellent series of video lectures.

We do have one operation in this algorithm that stands out: on line 7 there is a divide by 2. Twenty years ago, I might have converted this to a shift, but in my opinion the divide is clearer to read, and these days there is no competitive compiler in the world that won’t do this trivial strength reduction. According to my experiments with the Compiler Explorer, MSVC and GCC do it even without optimization.

STL-ready?

So there we have it. Our algorithm is looking pretty good now by my reckoning – appropriately generic and, overall, a good addition to my toolkit.

Remaining are just a couple of final considerations. The first is ranges: the Ranges TS will change the STL massively, providing for new algorithms and changes to existing ones. A specific consequence of ranges is that begin and end iterators need not be the same type any more; how that may affect this algorithm I leave as an exercise to the reader.

Even more generic?

The other consideration takes us further still down the path of making our algorithm as generic as possible. A useful technique for discovering the constraints on types is to print out the code in question and use markers to highlight variables that interact with each other in the same colour. This can reveal which variables interact as well as which ones are entirely disjoint in usage – a big help in determining how they should be made into template parameters. In general, examining how variables interact can be a fruitful exercise.

Thinking about how the variables are used in our case, we see (on line 7) that we need the ability to add the result of distance to T, and (on line 10) that the difference between Ts is comparable to the result of distance. We are used to thinking about addition and subtraction on integers, where they are closed operations (i.e. adding two integers or subtracting two integers gives you another integer), but there is another possibility. And in fact, there is already a good example of it in the STL.

Arithmetic the chrono way

It is possible for the type produced by subtraction on Ts to be something other than T. This is what happens with chrono, with time_point and duration. Subtracting two time_points yields a duration. It is meaningless – and, therefore, illegal – to add two time_points, but durations may be added both to themselves and to time_points.

In mathematical language, chrono models a one-dimensional affine space. (Thanks to Gašper Ažman for that particular insight!)

This is also the case with min_unused: T is our time_point equivalent, and the result of distance is our duration equivalent, suggesting that we can pull out the difference type as a template parameter:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
template <typename ForwardIt,
          typename T = typename std::iterator_traits<ForwardIt>::value_type,
          typename DiffT = typename std::iterator_traits<ForwardIt>::difference_type>
inline T min_unused(ForwardIt first, ForwardIt last, T init = T{})
{
  while (first != last) {
    T m = init + DiffT{(std::distance(first, last)+1)/2};
    auto p = std::partition(first, last,
                            [&] (const T& i) { return i < m; });
    if (DiffT{std::distance(first, p)} == m - init) {
      init = m;
      first = p;
    } else {
      last = p;
    }
  }
  return init;
}

The advantage here is that we can now use min_unused to deal with any code that has different value and difference types, like chrono:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// a vector of time_points, 1s apart
constexpr auto VECTOR_SIZE = 10;
std::vector<std::chrono::system_clock::time_point> v;
auto start_time = std::chrono::system_clock::time_point{};
std::generate_n(std::back_inserter(v), VECTOR_SIZE,
                [&] () {
                  auto t = start_time;
                  start_time += 1s;
                  return t;
                });
std::shuffle(v.begin(), v.end(), gen);
v.resize(3*VECTOR_SIZE/4);
 
for (auto i : v)
  std::cout
    << std::chrono::duration_cast<std::chrono::seconds>(i.time_since_epoch()).count()
    << '\n';
 
// we can now find the minimum time_point not present
auto u = min_unused<
  decltype(v.begin()), decltype(start_time), std::chrono::seconds>(
      v.begin(), v.end());
 
cout << "Min unused: "
     << std::chrono::duration_cast<std::chrono::seconds>(u.time_since_epoch()).count()
     << '\n';

See this code on wandbox.

This concludes our exercise in developing a generic algorithm – at least, for now. Good luck with your own experiments!

C++17 Class Templates: Deduced or Not?

Thursday, June 15th, 2017

C++17 introduces class template deduction: a way for the compiler to deduce the arguments to construct a class template without our having to write a make_* function. But it’s not quite as straightforward as it seems.

Imagine we have a simple type that will tell us when it’s copied or moved, just for testing.

struct S
{
  S() = default;
  S(const S&) { std::cout << "copy\n"; }
  S(S&&) { std::cout << "move\n"; }
};

And likewise a very simple template like so:

template <typename T>
struct Foo
{
  Foo(const T& t_) : t(t_) {}
  Foo(T&& t_) : t(std::move(t_)) {}
 
private:
  T t;
};

Note that we provide two forms of constructor to deal with both rvalues and lvalues being passed in. With C++14, prior to class template deduction, we would write a make_foo function to instantiate this template something like this:

template <typename T>
auto make_foo(T&& t)
{
  return Foo<std::decay_t<T>>(std::forward<T>(t));
}

And call it like so:

int main()
{
  // rvalue: move
  auto f1 = make_foo(S());
 
  // lvalue: copy
  S s;
  auto f2 = make_foo(s);
}

The important thing here is that the template argument to make_foo is deduced, and the template argument to Foo is not (cannot be, in C++14). Furthermore, because the template argument to make_foo is deduced, make_foo‘s argument is a forwarding reference rather than an rvalue reference, and hence we use perfect forwarding to pass it along to the appropriate constructor for Foo.

So far so good. Now, with C++17, class template deduction is available to us, so we can get rid of make_foo. Now our main() function looks like this:

int main()
{
  // rvalue: move?
  Foo f3{S()};
 
  // lvalue: copy?
  S s;
  Foo f4{s};
}

Here’s the unintuitive part: the template arguments are being deduced now. But in that case, doesn’t that mean that Foo‘s constructor argument is being deduced? Which means that what looks like Foo‘s rvalue constructor is actually now a constructor with a forwarding reference that will outcompete the other constructor? If so, that would be bad – we could end up moving from an lvalue!

I woke up the other morning with this worrying thought and had to investigate. The good news is that even though the code looks like it would break, it doesn’t.

So in fact, yes, although Foo‘s template argument is being deduced, I think the crucial thing is that Foo‘s constructor still takes an rvalue reference – not a forwarding reference – at the point of declaration. And according to 16.3.1.8 [over.match.class.deduct] the compiler is forming an overload set of function templates that match the signatures of the constructors available, and it’s using the correct types. In other words, I think it’s doing something that we could not do: it’s forming a function template, for the purposes of deduction, whose argument is an rvalue reference rather than a forwarding reference.

As is often the case in C++, one needs to be careful to properly distinguish things. It is very easy to get confused over rvalue references and forwarding references, because they look the same. The difference is that forwarding references must be deduced… and in this case, even though it looks like it’s deduced, it isn’t.

Edit: Reddit user mps1729 points out that indeed, neither of the implicitly generated functions is using a forwarding reference, as clarified in 17.8.2.1/3 [temp.deduct.call]. Thanks for the clarification!

CHRONO + RANDOM = ?

Monday, October 24th, 2016

Being a quick sketch combining <chrono> and <random> functionality, with cryptarithmetic interludes…

At CppCon this year there were several good talks about randomness and time calculations in C++. On randomness: Walter Brown’s What C++ Programmers Need to Know About Header <random> and Cheinan Marks’ I Just Wanted a Random Integer! were both excellent talks. And Howard Hinnant gave several great talks: A <chrono> Tutorial, and Welcome to the Time Zone, a followup to his talk from last year, A C++ Approach to Dates and Times.

CHRONO + RANDOM = HORRID ?

That’s perhaps a little unfair, but recently I ran into the need to compute a random period of time. I think this is a common use case for things like backoff schemes for network retransmission. And it seemed to me that the interaction of <chrono> and <random> was not quite as good as it could be:

system_clock::duration minTime = 0s;
system_clock::duration maxTime = 5s;
uniform_int_distribution<> d(minTime.count(), maxTime.count());
// 'gen' here is a Mersenne twister engine
auto nextTransmissionWindow = system_clock::duration(d(gen));

This code gets more complex when you start computing an exponential backoff. Relatively straightforward, but clumsy, especially if you want a floating-point base for your exponent calculation: system_clock::duration has an integral representation, so in all likelihood you end up having to cast multiple times, using either static_cast or duration_cast. That’s a bit messy.

I remembered some code from another talk: Andy Bond’s AAAARGH!? Adopting Almost Always Auto Reinforces Good Habits!? in which he presented a function to make a uniform distribution by inferring its argument type, useful in generic code. Something like the following:

template <typename A, typename B = A,
          typename C = std::common_type_t<A, B>,
          typename D = std::uniform_int_distribution<C>>
inline auto make_uniform_distribution(const A& a,
                                      const B& b = std::numeric_limits<B>::max())
  -> std::enable_if_t<std::is_integral<C>::value, D>
{
  return D(a, b);
}

Of course, the standard also provides uniform_real_distribution, so we can provide another template and overload the function for real numbers:

template <typename A, typename B = A,
          typename C = std::common_type_t<A, B>,
          typename D = std::uniform_real_distribution<C>>
inline auto make_uniform_distribution(const A& a,
                                      const B& b = B{1})
  -> std::enable_if_t<std::is_floating_point<C>::value, D>
{
  return D(a, b);
}

And with these two in hand, it’s easy to write a uniform_duration_distribution that uses the correct distribution for its underlying representation (using a home-made type trait to constrain it to duration types).

template <typename T>
struct is_duration : std::false_type {};
template <typename Rep, typename Period>
struct is_duration<std::chrono::duration<Rep, Period>> : std::true_type {};
 
template <typename Duration = std::chrono::system_clock::duration,
          typename = std::enable_if_t<is_duration<Duration>::value>>
class uniform_duration_distribution
{
public:
  using result_type = Duration;
 
  explicit uniform_duration_distribution(
      const Duration& a = Duration::zero(),
      const Duration& b = Duration::max())
    : m_a(a), m_b(b)
  {}
 
  void reset() {}
 
  template <typename Generator>
  result_type operator()(Generator& g)
  {
    auto d = make_uniform_distribution(m_a.count(), m_b.count());
    return result_type(d(g));
  }
 
  result_type a() const { return m_a; }
  result_type b() const { return m_b; }
  result_type min() const { return m_a; }
  result_type max() const { return m_b; }
 
private:
  result_type m_a;
  result_type m_b;
};

Having written this, we can once again overload make_uniform_distribution to provide for duration types:

template <typename A, typename B = A,
          typename C = std::common_type_t<A, B>,
          typename D = uniform_duration_distribution<C>>
inline auto make_uniform_distribution(const A& a,
                                      const B& b = B::max()) -> D
{
  return D(a, b);
}

And now we can compute a random duration more expressively and tersely, and, I think, in the spirit of the existing functionality that exists in <chrono> for manipulating durations.

auto d = make_uniform_distribution(0s, 5000ms);
auto nextTransmissionWindow = d(gen);

CHRONO + RANDOM = DREAMY

I leave it as an exercise for the reader to solve these cryptarithmetic puzzles. As for the casting problems, for now, I’m living with them.

An algorithmic sketch: inplace_merge

Sunday, March 13th, 2016

One of the things I like to do in my spare time is study the STL algorithms. It is easy to take them for granted and easy, perhaps, to imagine that they are mostly trivial. And some are: I would think that any decent interview candidate ought to be able to write find correctly. (Although even such a trivial-looking algorithm is based on a thorough understanding of iterator categories.)

Uncovering the overlooked

But there are some algorithms that are non-trivial, and some are important building blocks. Take inplace_merge. For brevity, let’s consider the version that just uses operator< rather than being parametrized on the comparison. The one easily generalizes to the other in a way that is not important to the algorithm itself.

template <typename BidirectionalIterator>
void inplace_merge(BidirectionalIterator first,
                   BidirectionalIterator middle,
                   BidirectionalIterator last);

It merges two consecutive sorted ranges into one sorted range. That is, if we have an input like this:

x_0 \: x_1 \: x_2 \: ... \: x_n \: y_0 \: y_1 \: y_2 \: ... \: y_m

Where \forall i < n, x_i \leq x_{i+1} and \forall j < m, y_j \leq y_{j+1}

We get this result occupying the same space:

r_0 \: r_1 \: r_2 \: ... \: r_{n+m}

Where \forall i < n+m, r_i \leq r_{i+1} and the new range is a permutation of the original ranges. In addition, the standard states a few additional constraints:

  • inplace_merge is stable - that is, the relative order of equivalent elements is preserved
  • it uses a BidirectionalIterator which shall be ValueSwappable and whose dereferent (is that a word?) shall be MoveConstructible and MoveAssignable
  • when enough additional memory is available, (last-first-1) comparisons. Otherwise an algorithm with complexity N log(N) (where N = last-first) may be used

Avenues of enquiry

Leaving aside the possible surprise of discovering that an STL algorithm may allocate memory, some thoughts spring to mind immediately:

  • Why does inplace_merge need a BidirectionalIterator?
  • How much memory is required to achieve O(n) performance? Is a constant amount enough?

And to a lesser extent perhaps:

  • Why are merge and inplace_merge not named the same way as other algorithms, where the more normal nomenclature might be merge_copy and merge?
  • What is it with the algorists' weasel-word "in-place"?

First thoughts about the algorithm

It seems that an O(n log n) algorithm should be possible on average, because in the general case, simply sorting the entire range produces the desired output. Although the sort has to be stable, which means something like merge sort, which leads us down a recursive rabbit hole. Hm.

At any rate, it's easy to see how to achieve inplace_merge with no extra memory needed by walking iterators through the ranges:

template <typename ForwardIt>
void naive_inplace_merge(
    ForwardIt first, ForwardIt middle, ForwardIt last)
{
  while (first != middle && middle != last) {
    if (*middle < *first) {
      std::iter_swap(middle, first);
      auto i = middle;
      std::rotate(++first, i, ++middle);
    } else {
      ++first;
    }
  }
}

After swapping (say) x_0 and y_0, the ranges look like this:

y_0 \: x_1 \: x_2 \: ... \: x_n \: x_0 \: y_1 \: y_2 \: ... \: y_m

And the call to rotate fixes up x_1 \: ... \: x_n \: x_0 to be ordered again. From there we proceed as before on the ranges x_0 \: ... \: x_n and y_1 \: ... \: y_m.

This algorithm actually conforms to the standard! It has O(n) comparisons, uses no extra memory, and has the advantage that it works on ForwardIterator! But unfortunately, it's O(n²) overall in operations, because of course, rotate is O(n). So how can we do better?

Using a temporary buffer

If we have a temporary buffer available that is equal in size to the smaller of the two ranges, we can move the smaller range to it, move the other range up if necessary, and perform a "normal" merge of the two ranges into the original space:

template <typename BidirIt>
void naive_inplace_merge2(
    BidirIt first, BidirIt middle, BidirIt last)
{
  using T = typename std::iterator_traits<BidirIt>::value_type;
 
  auto d1 = std::distance(first, middle);
  auto d2 = std::distance(middle, last);
 
  auto n = std::min(d1, d2);
  auto tmp = std::make_unique<char[]>(n * sizeof(T));
  T* begint = reinterpret_cast<T*>(tmp.get());
  T* endt = begint + n;
 
  if (d1 <= d2)
  {
    std::move(first, middle, begint);
    std::merge(begint, endt, middle, last, first);
  }
  else
  {
    std::move(middle, last, begint);
    auto i = std::move_backward(first, middle, last);
    std::merge(i, last, begint, endt, first);
  }
}

This is essentially the algorithm used by STL implementations if buffer space is available. And this is the reason why inplace_merge requires BidirectionalIterator: because move_backward does.

(This isn't quite optimal: the std::move_backward can be mitigated with reverse iterators and predicate negation, but the BidirectionalIterator requirement remains. Also, strictly speaking, std::merge is undefined behaviour here because one of the input ranges overlaps the output range, but we know the equivalent loop is algorithmically safe.)

Provisioning of the temporary buffer is also a little involved because we don't know that elements in the range are default constructible (and perhaps we wouldn't want to default-construct our temporaries anyway). So to deal correctly with non-trivial types here, std::move should actually be a loop move-constructing values. And when std::inplace_merge is used as a building block for e.g. std::stable_sort, it would also be nice to minimize buffer allocation rather than having an allocation per call. Go look at your favourite STL implementation for more details.

Thinking further

The literature contains more advanced algorithms for merging if a suitably-sized buffer is not available: the basis for the STL's choice is covered in Elements of Programming chapter 11, and in general the work of Dudzinski & Dydek and of Kim & Kutzner seems to be cited a lot.

But I knew nothing of this research before tackling the problem, and attempting to solve it requiring just ForwardIterator.

I spent a couple of evenings playing with how to do inplace_merge. I covered a dozen or more A4 sheets of squared paper with diagrams of algorithms evolving. I highly recommend this approach! After a few hours of drawing and hacking I had a really good idea of the shape of things. Property-based testing came in very handy for breaking my attempts, and eventually led me to believe that a general solution on the lines I was pursuing would either involve keeping track of extra iterators or equivalently require extra space. Keeping track of iterators seemed a messy approach, so an extra space approach is warranted.

How much extra space? Consider the "worst case":

x_0 \: x_1 \: x_2 \: ... \: x_n \: y_0 \: y_1 \: y_2 \: ... \: y_m

Assume for the moment that m \leq n. When y_m < x_0, we need extra space to hold all of x_0 \: ... \: x_m. If n \leq m then we will need extra space for x_0 \: ... \: x_n to likewise move them out of the way. Either way, the number of units of extra space we need is min(n, m).

As we move elements of x into temporary storage, we can see that in general at each stage of the algorithm we will have a situation something like this (using Z to mean a moved-from value):

... \: x_i \: ... \: x_n \: Z_0 \: ... \: Z_{j-1} \: y_j \: ... \: y_m

With some values of x moved into temporary storage:

x_{i-t} \: ... \: x_{i-1}

The temporary storage here is a queue: we always push on to the end and pop from the beginning, since the elements in it start, and remain, ordered. Since we know an upper bound on the number of things in the queue at any one time, it can be a ring buffer (recently proposed) over a regular array of values.

Sketching the start

From this, we can start sketching out an algorithm:

  1. Allocate a buffer of size min(m, n) - call it tmp
  2. We'll walk the iterators along the x (first) and y (middle) ranges
  3. The output iterator o will start at first
  4. The next x to consider will either be in-place in the x range, or (if tmp is not empty) in tmp - call it xlow
  5. If *y < *xlow move *x to tmp, move *y to o, inc y
  6. Otherwise, if *xlow is in tmp, move *x to tmp and *xlow from tmp to o
  7. inc o, inc x
  8. if y < last and o < middle, goto 4
  9. deal with the end(s) of the ranges

Dealing with the end

This gets us as far as exhausting the smaller range: after this, we will be in one of two situations.

Situation 1. If we have exhausted the y range, things look like this:

... \: x_i \: ... \: x_n \: Z_0 \: ... \: Z_m

With values of x in temporary storage:

x_{i-t} \: ... \: x_{i-1}

To fix up this situation, we can repeatedly swap the tmp range with the equivalent x range until we reach middle (i.e Z_0), and then simply move the remaining tmp values into place.

I originally wrote a loop repeatedly swapping the values in tmp right up to the end; but I realised that would involve swapping a moved-from object, which would be wrong (it might work… until it doesn’t). Moved-from objects should either be destroyed or made whole (assigned to); nothing else.

Situation 2. The possibility is that we have exhausted the x range, in which case things look like this:

... \: Z_0 \: ... \: Z_{n-i} \: y_j \: ... \: y_m

With values of x in temporary storage:

x_i \: ... \: x_n

To fix up this situation, we can just do a regular merge on the remaining y range and tmp, outputting starting at middle (i.e Z_0). (With the same proviso as before about undefined behaviour with overlapping ranges.) We know that it will be safe to do a loop equivalent to merge, because we have exactly the required amount of space before y_j to fit x_i \: ... \: x_n. This is the same as the STL’s normal buffered merge strategy.

Final thoughts

I tackled this exercise from scratch, knowing nothing about actual implementations of inplace_merge. This algorithm does some extra housekeeping under the hood, but:

  • it does the minimum number of comparisons
  • each element is moved at most twice: into tmp and out again
  • it needs only ForwardIterator

Optimization and benchmarking under differing conditions of range size, comparison and move cost is left as an exercise to the reader…

I cannot recommend Elements of Programming enough. I am partway through reading it; after this exercise I skipped to chapter 11 to see what it said. Every time I dive into the STL algorithms, I am re-impressed by the genius of Alex Stepanov: Paul McJones’ recent talk The Concept of Concept explains this well, in particular the key role of concepts in the STL in service of algorithmic purity. Alex knew about concepts from the start: it’s taken C++ over 2 decades to catch up.

After doing this work, I discovered a recent proposal that calls for weakening the iterator categories of inplace_merge and related algorithms.

An implementation of this algorithm is on github. It’s just a sketch, written for clarity rather than optimality. This has been a fun exercise.

Lameness Explained

Wednesday, December 9th, 2015

OK, more than one person wanted explanations of The C++ <random> Lame List, so here are some of my thoughts, if only to save people searching elsewhere.

  1. Calling rand() is lame because it’s an LCG with horrible randomness properties, and we can do better. And if you’re not calling rand(), there’s no reason to call srand().
  2. Using time(NULL) to seed your RNG is lame because it doesn’t have enough entropy. It’s only at a second resolution, so in particular, starting multiple processes (e.g. a bringing up bunch of servers) at the same time is likely to seed them all the same.
  3. No, rand() isn’t good enough even for simple uses, and it’s easy to do the right thing these days. The lower order bits of rand()‘s output are particularly non-random, and odds are that if you’re using rand() you’re also using % to get a number in the right range. See item 6.
  4. In C++14 random_shuffle() is deprecated, and it’s removed in C++17, which ought to be reason enough. If you need more reason, one version of it is inconvenient to use properly (it uses a Fisher-Yates/Knuth shuffle so takes an RNG that has to return random results in a shifting range) and the other version of it can use rand() under the hood. See item 1.
  5. default_random_engine is implementation-defined, but in practice it’s going to be one of the standard generators, so why not just be explicit and cross-platform-safe (hint: item 10)?. Microsoft’s default is good, but libc++ and libstdc++ both use LCGs as their default at the moment. So not much better than rand().
  6. It is overwhelmingly likely that whatever RNG you use, it will output something in a power-of-two range. Using % to get this into the right range probably introduces bias. Re item 3, consider a canonical simple use: rolling a d6. No power of two is divisible by 6, so inevitably, % will bias the result. Use a distribution instead. STL (and others) have poured a lot of time into making sure they aren’t biased.
  7. random_device is standard, easy to use, and should be high quality randomness. It may not be very well-performing, which is why you probably want to use it for seeding only. But you do want to use it (mod item 8).
  8. Just know your platform. It might be fine in desktop-land, but random_device isn’t always great. It’s supposed to be nondeterministic and hardware based if that’s available… trust but verify, as they say.
  9. Not handling exceptions is lame. And will bite you. I know this from experience with random_device specifically.
  10. The Mersenne twisters are simply the best randomness currently available in the standard.
  11. Putting mt19937 on the stack: a) it’s large (~2.5k) and b) you’re going to be initializing it each time through. So probably not the best. See item 17 for an alternative.
  12. You’re just throwing away entropy if you don’t seed the generator’s entire state. (This is very common, though.)
  13. Simply, uniform_int_distribution works on a closed interval (as it must – otherwise it couldn’t produce the maximum representable value for the given integral type). If you forget this, it’s a bug in your code – and maybe one that takes a while to get noticed. Not good.
  14. Forgetting ref() around your generator means you’re copying the state, which means you’re probably not advancing the generator like you thought you were.
  15. seed_seq is designed to seed RNGs, it’s that simple. It tries to protect against poor-quality data from random_device or whatever variable-quality source of entropy you have.
  16. Not considering thread safety is always lame. Threads have been a thing for quite a while now.
  17. thread_local is an easy way to get “free” thread safety for your generators.
  18. You should be using a Mersenne twister (item 10) so just use the right thing for max(). Job done.

If you want more, see rand() Considered Harmful (a talk by Stephan T Lavavej), or The bell has tolled for rand() (from the Explicit C++ blog), or see Melissa O’Neill’s Reddit thread, her talk on PCG, and the associated website.

And of course, cppreference.com.

The C++ <random> Lame List

Monday, December 7th, 2015

Network programmers of a certain age may remember the Windows Sockets Lame List.

I previously wrote a short “don’t-do-that-do-this” guide for modern C++ randomness, and I was recently reading another Reddit exchange featuring STL, author of many parts of Microsoft’s STL implementation, when it struck me that use of C++ <random> needs its own lame list to discourage using the old and busted C parts and encourage the using the new C++ hotness. So here, in no particular order, and with apologies to Keith Moore (wherever he may be) is an incomplete lame list for use of <random>.

  1. Calling rand() or srand(). Lame.
  2. Using time(NULL) to seed an RNG. Inexcusably lame.
  3. Claiming, “But rand() is good enough for simple uses!” Dog lame.
  4. Using random_shuffle() to permute a container. Mired in a sweaty mass of lameness.
  5. Using default_random_engine. Nauseatingly lame.
  6. Using % to get a random value in a range. Lame. Lame. Lame. Lame. Lame.
  7. Not using random_device to seed an RNG. Violently lame.
  8. Assuming that random_device is going to do the right thing on your platform. Uncontrollably lame.
  9. Failing to handle a possible exception from the construction or use of random_device. Totally lame.
  10. Using anything in the standard but mt19937 or mt19937_64 as a generator. Intensely lame.
  11. Putting mt19937 on the stack. In all my years of observing lameness, I have seldom seen something this lame.
  12. Seeding mt19937 with only one 32-bit word rather than its full state_size. Pushing the lameness envelope.
  13. Forgetting that uniform_int_distribution works on a closed interval. Thrashing in a sea of lameness.
  14. Passing random_device or a generator to generate_n() by value, forgetting to wrap it with ref(). Glaringly lame.
  15. Failing to use seed_seq to initialize a generator’s state properly. Indescribably lame.
  16. Not considering thread safety when using a generator. Floundering in an endless desert of lameness.
  17. Using a global generator without making it thread_local. Suffocating in self lameness.
  18. Using RAND_MAX instead of mt19937::max(). Perilously teetering on the edge of a vast chasm of lameness.

This list will undoubtedly grow as I continue to write lame code…

Compile-time RNG tricks

Thursday, October 15th, 2015

(Start at the beginning of the series – and all the source can be found in my github repo)

Compile-time random number generation is quite useful. For instance, we could generate GUIDs (version 4 UUIDs):

namespace cx
{
  struct guid_t
  {
    uint32_t data1;
    uint16_t data2;
    uint16_t data3;
    uint64_t data4;
  };
 
  template <uint64_t S>
  constexpr guid_t guidgen()
  {
    return guid_t {
        cx::pcg32<S>(),
        cx::pcg32<S>() >> 16,
        0x4000 | cx::pcg32<S>() >> 20,
        (uint64_t{8 + (cx::pcg32<S>() >> 30)} << 60)
        | uint64_t{cx::pcg32<S>() & 0x0fffffff} << 32
        | uint64_t{cx::pcg32<S>()} };
  }
}
 
#define cx_guid \
  cx::guidgen<cx::fnv1(__FILE__ __DATE__ __TIME__) + __LINE__>

There are situations right now where one might have scripts (probably Python or some such language) in one’s build chain that do this type of generation of C++ code, for example to introduce randomness to each build in the name of security, or simply to mark the build uniquely. It would be nice not to have to maintain extra non-C++ code and extra steps in the build chain.

Is PCG32 secure? Maybe secure enough…

Security you say? We have a source of random numbers that changes every build… we could transparently “encrypt” string literals and decode them at point of use. Might be useful to someone. To wrangle with a string in a constexpr manner, I need to treat it as an array, and then encrypt (I’ll just xor) each character with my random byte stream. I can figure out how long a string literal is at compile time, store the random seed as a template parameter, and use an index_sequence expansion to do the encryption.

template <uint64_t S>
constexpr char encrypt_at(const char* s, size_t idx)
{
  return s[idx] ^
    static_cast<char>(pcg32_output(
      pcg32_advance(S, idx+1)) >> 24);
}
 
template <size_t N>
struct char_array
{
  char data[N];
};
 
template <uint64_t S, size_t ...Is>
constexpr char_array<sizeof...(Is)> encrypt(
    const char *s, std::index_sequence<Is...>)
{
  return {{ encrypt_at<S>(s, Is)... }};
}

That takes care of constructing an encrypted array from a string literal. The rest is just wrapping it up in an encrypted_string class and providing a sane runtime decryption function (we could use the existing C++11 constexpr functions, but they are strangely formulated for runtime use – maybe a more natural formulation would be easier to optimize). And we can give the class a conversion to string.

inline std::string decrypt(uint64_t S, const char* s, size_t n)
{
  std::string ret;
  ret.reserve(n);
  for (size_t i = 0; i < n; ++i)
  {
    S = pcg32_advance(S);
    ret.push_back(s[i] ^
      static_cast<char>(pcg32_output(S) >> 24));
  }
  return ret;
}
 
template <uint64_t S, size_t N>
class encrypted_string
{
public:
  constexpr encrypted_string(const char(&a)[N])
    : m_enc(encrypt<S>(a, std::make_index_sequence<N-1>()))
  {}
 
  constexpr size_t size() const { return N-1; }
 
  operator std::string() const
  {
    return decrypt(S, m_enc.data, N-1);
  }
 
private:
  const char_array<N-1> m_enc;
};
 
template <uint64_t S, size_t N>
constexpr encrypted_string<S, N> make_encrypted_string(
    const char(&s)[N])
{
  return encrypted_string<S, N>(s);
}
 
#define CX_ENCSTR_RNGSEED \
  uint64_t{cx::fnv1(__FILE__ __DATE__ __TIME__) + __LINE__}
 
#define cx_make_encrypted_string \
  cx::make_encrypted_string<CX_ENCSTR_RNGSEED>

The more observant and pedantic among you may have noticed that strictly, I’ve strayed into C++14 territory here, using std::index_sequence. But I think I’m still in the spirit of C++11 constexpr. And more importantly, VS2015 supports std::integer_sequence.

Anyway, let’s exercise this code and check it does the right thing:

int main(int, char* [])
{
  constexpr auto enc =
    cx_make_encrypted_string("I accidentally the string");
  cout << string(enc) << endl;
  return 0;
}

Here’s the result:

$ ./test_constexpr
I accidentally the string
$

And the plaintext string doesn’t appear in the binary:

$ objdump -s -j .rodata ./test_constexpr
 
./test_constexpr:     file format elf64-x86-64
 
Contents of section .rodata:
 404700 01000200 d31e337d 58244d8a 48e52178  ......3}X$M.H.!x
 404710 139d483c 5113d3f1 0aec79a3 80626173  ..H<Q.....y..bas
 404720 69635f73 7472696e 6700               ic_string.

Not bad. Per-compile obfuscation on string literals, with zero memory overhead and only a modest cost at the point of use.

Compile-time counters, revisited

Wednesday, October 14th, 2015

(Start at the beginning of the series – and all the source can be found in my github repo)

Some time ago I read a blog post showing how to make a compile-time counter: a constexpr function that would return monotonically increasing integers. When I first read it I didn’t really take the time to understand it fully, but now that I was on a compile-time computation kick, I decided to grok it fully.

Without getting too far into the nitty-gritty (go read the other blog post if you’re interested), the technique relies on using template instantiations to “bring-to-life” functions that affect future template instantiations. Thus we have a flag that declares (but does not yet define) a friend function:

template <int N>
struct flag
{
  friend constexpr int adl_flag (flag<N>);
};

And then a recursive reader function template that uses ADL and SFINAE to match against the (as-yet-undefined) friend function(s), bottoming out at zero:

template <int N, int = adl_flag(flag<N>{})>
constexpr int reader(int, flag<N>)
{
  return N;
}
 
template <int N>
constexpr int reader(
    float, flag<N>, int R = reader(0, flag<N-1>{}))
{
  return R;
}
 
constexpr int reader(float, flag<0>)
{
  return 0;
}

And finally, a writer that, when instantiated with the result of calling reader, instantiates the friend function, making the next call to reader terminate at one level higher:

template <int N>
struct writer
{
  friend constexpr int adl_flag (flag<N>)
  {
    return N;
  }
  static constexpr int value = N;
};

This is a curiosity, right? A foible of C++, a fairy tale told by wizened programmers to fresh graduates to simultaneously impress and revolt them, no? Could anything useful be done with this? Well, C++ is full of such tales, and it’s a short hop from can’t-look-away-revolting to established feature – after all, template metaprogramming was discovered pretty much by accident…

In fact, while at CppCon, I met up with Ansel and Barbara from CopperSpice, who are using a very similar technique to do away with the Qt Metaobject Compiler.

Max recursion depth, we meet again

My first thought was that this technique suffers at the hands of my old enemy, maximum recursion depth. In this case, maximum template instantiation depth, which despite a standard-recommended 1024, is frequently just 256 – lower than the recommended constexpr recursion depth of 512. So let’s do something about that.

Well, one quick-and-dirty way to do this is to compute the count in two halves: lower bits and upper bits, and then stick them together. When we reach the max on the lower bits, we’ll roll over one of the upper bits. So we have two flags representing the high and low, with the low flag also parameterized on the high bits:

template <int H, int L>
struct flag1
{
  friend constexpr int adl_flag1(flag1<H, L>);
};
template <int H>
struct flag2
{
  friend constexpr int adl_flag2(flag2<H>);
};

And two readers: the low bits reader is in a struct to avoid partial specialization of a function, because it’s effectively parameterized on the high bits as well as the low bits.

template <int H>
struct r1
{
  template <int L, int = adl_flag1(flag1<H, L>{})>
  static constexpr int reader(int, flag1<H, L>)
  {
    return L;
  }
  template <int L>
  static constexpr int reader(
      float, flag1<H, L>, int R = reader(0, flag1<H, L-1>{}))
  {
    return R;
  }
  static constexpr int reader(float, flag1<H, 0>)
  {
    return 0;
  }
};
 
template <int H, int = adl_flag2(flag2<H>{})>
constexpr int reader(int, flag2<H>)
{
  return H;
}
template <int H>
constexpr int reader(
    float, flag2<H>, int R = reader(0, flag2<H-1>{}))
{
  return R;
}
constexpr int reader(float, flag2<0>)
{
  return 0;
}

The low bits writer looks much the same as before, and the high bits writer is specialized on a bool indicating whether or not to instantiate the friend function, which we only do when the low bits roll over:

template <int H, bool B>
struct writehi
{
  friend constexpr int adl_flag2(flag2<H>)
  {
    return H;
  }
  static constexpr int value = H;
};
 
template <int H>
struct writehi<H, false>
{
  static constexpr int value = H;
};

The writer can then write both the high and low bits accordingly:

template <int H, int L>
struct writer
{
  static constexpr int hi_value =
    writehi<H+1, L == MAX>::value;
  static constexpr int lo_value =
    writelo<H, (L & BIT_MASK)>::value;
  static constexpr int value = (H << BIT_DEPTH) + L;
};

Using this approach we can easily increase the maximum number that we can get out of our counter from 256 to 16k or so, which is enough for one translation unit, for me.

Random acts of compiler abuse

Now, a counter is OK, as far as that goes, but something more useful might be nice… how about random numbers? But surely only a madman would try to implement a Mersenne Twister in C++11 constexpr-land. (This despite the fact that I did SHA256 string hashing.) No, these days when I think of random numbers, I think of Melissa O’Neill and her excellent PCG32. If you haven’t seen her video, go watch it. I’ll wait here. PCG32 has some distinct advantages for constexpr implementation:

  • It’s easy to implement
  • It’s fast
  • It’s not a lot of code (seriously, ~10 lines)
  • It’s easy to implement
  • It’s understandable
  • It’s easy to implement

Here’s a simple implementation of the whole 32-bit affair:

constexpr uint64_t pcg32_advance(uint64_t s)
{
  return s * 6364136223846793005ULL
    + (1442695040888963407ULL | 1);
}
 
constexpr uint64_t pcg32_advance(uint64_t s, int n)
{
  return n == 0 ? s : pcg32_advance(pcg32_advance(s), n-1);
}
 
constexpr uint32_t pcg32_xorshift(uint64_t s)
{
  return ((s >> 18u) ^ s) >> 27u;
}
 
constexpr uint32_t pcg32_rot(uint64_t s)
{
  return s >> 59u;
}
 
constexpr uint32_t pcg32_output(uint64_t s)
{
  return (pcg32_xorshift(s) >> pcg32_rot(s))
    | (pcg32_xorshift(s) << ((-pcg32_rot(s)) & 31));
}

And now we can use exactly the same pattern as the constexpr counter, except the writer, instead of giving us an integer, will give us a random number (and we also plumb through the random seed S as a template parameter):

template <uint64_t S, int H, int L>
struct writer
{
  static constexpr int hi_value =
    writehi<S, H+1, L == MAX>::value;
  static constexpr int lo_value =
    writelo<S, H, (L & BIT_MASK)>::value;
  static constexpr uint32_t value =
    pcg32_output(pcg32_advance(S, (H << BIT_DEPTH) + L));
};

There is probably a way to improve this by storing the actual PCG-computed value in a template instantiation, rather than simply using the integer to pump the PCG every time as I am doing. But it’s a proof-of-concept and works well enough for now. A simple macro will give us a suitable seed for our compile-time RNG by hashing a few things together:

#define cx_pcg32 \
  cx::pcg32<cx::fnv1(__FILE__ __DATE__ __TIME__) + __LINE__>

And now we have a compile-time RNG that is seeded differently every compile, every file, every line. There are potentially a lot of template instantiations – maybe using a lot of memory in the compiler – but we can do some useful things with this.

More string hashing with C++11 constexpr

Wednesday, October 14th, 2015

(Start at the beginning of the series – and all the source can be found in my github repo)

So FNV1 was easy, and Murmur3 wasn’t too much harder; for a challenge and to see how far I could go, I decided to try to compute an MD5 string hash using C++11 constexpr.

This was significantly harder. I broke out my copy of Applied Cryptography 2e, found a reference implementation of MD5 in C, read through RFC 1321 and the pseudocode on the Wikipedia page.

Few, few the bird make her nest

I built up MD5 piece by piece, pulling out parts of the reference implementation to check that I’d got each building block right before moving on. The actual round function primitives were the easy part. As usual for a hash function, they are a mixture of bitwise functions, shifts, rotates, adds. These types of things make for trivial constexpr functions, for example:

constexpr uint32_t F(uint32_t X, uint32_t Y, uint32_t Z)
{
  return (X & Y) | (~X & Z);
}
 
constexpr uint32_t rotateL(uint32_t x, int n)
{
  return (x << n) | (x >> (32-n));
}
 
constexpr uint32_t FF(uint32_t a, uint32_t b,
                      uint32_t c, uint32_t d,
                      uint32_t x, int s, uint32_t ac)
{
  return rotateL(a + F(b,c,d) + x + ac, s) + b;
}

There are similar functions for the other low-level primitives of the MD5 round functions, conventionally called F, G, H and I.

Now, MD5 works on buffer chunks of 512 bits, or 16 32-bit words. So assuming we have a string long enough, it’s easy, if a bit long-winded, to convert a block of string into a schedule that MD5 can work on:

struct schedule
{
  uint32_t w[16];
};
 
constexpr schedule init(const char* buf)
{
  return { { word32le(buf), word32le(buf+4), word32le(buf+8), word32le(buf+12),
        word32le(buf+16), word32le(buf+20), word32le(buf+24), word32le(buf+28),
        word32le(buf+32), word32le(buf+36), word32le(buf+40), word32le(buf+44),
        word32le(buf+48), word32le(buf+52), word32le(buf+56), word32le(buf+60) } };
}

Seconds out, round one

It’s not pretty, but such are the constructs that may arise when you only have C++11 constexpr to work with. The output of MD5 is going to be 4 32-bit words of hash (denoted A, B, C and D in the literature), and in the main loop, which happens for each 512 bits of the message, there are four rounds, each round having 16 steps. After each step, the 4 words are rotated so that A becomes the new B, B becomes the new C, etc. So a round step is fairly easy – here’s the round 1 step which uses the F primitive:

struct md5sum
{
  uint32_t h[4];
};
 
constexpr md5sum round1step(const md5sum& sum,
                            const uint32_t* block,
                            int step)
{
  return { {
      FF(sum.h[0], sum.h[1], sum.h[2], sum.h[3],
         block[step], r1shift[step&3], r1const[step]),
        sum.h[1], sum.h[2], sum.h[3]
        } };
}

As you can see, there are some constants (r1shift and r1const) for each stage of round 1. Rotating the words after each round step is also easy:

constexpr md5sum rotateCR(const md5sum& sum)
{
  return { { sum.h[3], sum.h[0], sum.h[1], sum.h[2] } };
}
 
constexpr md5sum rotateCL(const md5sum& sum)
{
  return { { sum.h[1], sum.h[2], sum.h[3], sum.h[0] } };
}

So now we are able to put together a complete round, which recurses, calling the round step function and rotating the output until we’re done after 16 steps.

constexpr md5sum round1(const md5sum& sum,
                        const uint32_t* msg, int n)
{
  return n == 16 ? sum :
    rotateCL(round1(rotateCR(round1step(sum, msg, n)), msg, n+1));
}

Rounds 2 through 4 are very similar, but instead of using the F primitive, they use G, H and I respectively. A complete MD5 transform for one 512-bit block looks like this (with a helper function that sums the MD5 result parts):

constexpr md5sum sumadd(const md5sum& s1, const md5sum& s2)
{
  return { { s1.h[0] + s2.h[0], s1.h[1] + s2.h[1],
        s1.h[2] + s2.h[2], s1.h[3] + s2.h[3] } };
}
 
constexpr md5sum md5transform(const md5sum& sum,
                              const schedule& s)
{
  return sumadd(sum,
                round4(
                    round3(
                        round2(
                            round1(sum, s.w, 0),
                            s.w, 0),
                        s.w, 0),
                    s.w, 0));
}

The ugly business of padding

So far so good. This works, as long as we’re processing the complete 512-bit blocks contained in the message. Now to consider how to finish off. The padding scheme for MD5 is as follows:

  • Append a 1-bit (this is always done, even if the message is a multiple of 512 bits)
  • Add as many 0-bits as you need to, to make up 448 bits (56 bytes)
  • Append a 64-bit value of the original length in bits to the 448 bits to make a final 512-bit block

This gets messy in C++11 constexpr-land, but suffice to say that I wrote a leftover function analogous to init that could deal with padding. Now, finally, the complete MD5 calculation, which has three conditions:

  1. As long as there is a 64-byte (512-bit) block to work on, recurse on that.
  2. If the leftover is 56 bytes or more, pad it without the length in there and recurse on an “empty block”.
  3. Otherwise, pad with padding and length, transform and we’re done!
constexpr md5sum md5update(const md5sum& sum, const char* msg,
                            int len, int origlen)
{
  return
    len >= 64 ?
    md5update(md5transform(sum, init(msg)), msg+64, len-64, origlen) :
    len >= 56 ?
    md5update(md5transform(sum,
                           leftover(msg, len, origlen, 64)),
              msg+len, -1, origlen) :
    md5transform(sum, leftover(msg, len, origlen, 56));
}

Woot!

I don’t mind admitting that when I finally got this working, I did a happy dance around my apartment. Even though the code has ugly parts. But of course the thrill of achievement soon gives way to the thirst for more, and MD5 isn’t exactly today’s choice for hash algorithms, even if it is still often used where cryptographic strength isn’t paramount. So I started thinking about SHA256.

Of course, cryptography proceeds largely by incrementally twiddling algorithms when they are found lacking: adding rounds, beefing up functions, etc. And so the NSA’s SHA series proceeded from Ron Rivest’s MD series. In particular, the block size and padding schemes are identical. That was a huge leg-up on SHA256, since I’d already solved the hardest part and I could reuse it.

SHA256 produces a larger digest, and obviously has different magic numbers that go into the rounds, but structurally it’s very similar to MD5. Again I used Wikipedia’s pseudocode as a reference. The only real thing that I needed to do over MD5, besides change up some trivial maths, was the schedule extend step. SHA256 copies the 16 words of the message to a 64-word block to work on, and extends the 16 words into the remaining 48 words. Computing a single value in the schedule, w[i], is easily written in a recursive style, where n represents the “real” values we already have, initially 16:

constexpr uint32_t extendvalue(const uint32_t* w, int i, int n)
{
  return i < n ? w[i] :
    extendvalue(w, i-16, n) + extendvalue(w, i-7, n)
    + s0(extendvalue(w, i-15, n)) + s1(extendvalue(w, i-2, n));
}

And to extend all the values, we can then do:

constexpr schedule sha256extend(const schedule& s)
{
  return { { s.w[0], s.w[1], s.w[2], s.w[3],
        s.w[4], s.w[5], s.w[6], s.w[7],
        s.w[8], s.w[9], s.w[10], s.w[11],
        s.w[12], s.w[13], s.w[14], s.w[15],
        extendvalue(s.w, 16, 16), extendvalue(s.w, 17, 16),
        //
        // and so on...
        //
        extendvalue(s.w, 62, 16), extendvalue(s.w, 63, 16) } };
}

When I did this initially, I ran into the compiler’s max step limit for a constexpr computation. Different from the max recursion limit, the max step limit is a guideline for how many expressions should be evaluated within a single constant expression. Appendix B suggests 220. If a compiler did memoization on extendvalue I suspect this would be fine, but evidently clang doesn’t, so to compromise, I split the function into 3: sha256extend16, sha256extend32 and sha256extend48, each of which extends by 16 words at a time. And that worked.

After extending the schedule and changing some of the maths functions, the rest was easy – practically the same as for MD5. Now I was done with string hashing.

For the next experiments with compile-time computation, I wanted to understand a strange thing I’d seen online…

C++11 compile-time string hashing

Tuesday, October 13th, 2015

(Start at the beginning of the series – and all the source can be found in my github repo)

Now that I was used to writing C++11-style constexpr, I decided to try some compile-time string hashing. It turns out that FNV1 is very easy to express in a move-down-the-string recursive style:

namespace detail
{
  constexpr uint64_t fnv1(uint64_t h, const char* s)
  {
    return (*s == 0) ? h :
      fnv1((h * 1099511628211ull) ^
           static_cast<uint64_t>(*s), s+1);
  }
}
constexpr uint64_t fnv1(const char* s)
{
  return true ?
    detail::fnv1(14695981039346656037ull, s) :
    throw err::fnv1_runtime_error;
}

That wasn’t difficult enough

So far so good. How about some other hash functions? On a previous project I had used MurmurHash, so I decided to give that a go.

(Coincidentally, a blog post has recently surfaced implementing 32-bit Murmur3 using C++14 constexpr. But I was sticking to C++11 constexpr.)

Implementing such a big function from cold in C++11 constexpr was bound to be difficult, so I scaffolded. I used the freely-available runtime implementation for testing, and initially I converted it to C++14 constexpr, which was fairly easy. After doing that, and with Wikipedia as a pseudocode guide I was ready to start breaking down (or building up) the functionality, C++11-style. At the base of 32-bit Murmur3 is the hash round function which is done for each 4-byte chunk of the key:

constexpr uint32_t murmur3_32_k(uint32_t k)
{
  return (((k * 0xcc9e2d51) << 15)
          | ((k * 0xcc9e2d51) >> 17)) * 0x1b873593;
}
 
constexpr uint32_t murmur3_32_hashround(
    uint32_t k, uint32_t hash)
{
  return (((hash^k) << 13)
          | ((hash^k) >> 19)) * 5 + 0xe6546b64;
}

And this can easily be put into a loop, with a helper function to constitute 4-byte chunks (the somewhat strange formulation of word32le here is because it is used differently in some other code):

constexpr uint32_t word32le(const char* s, int len)
{
  return
    (len > 0 ? static_cast<uint32_t>(s[0]) : 0)
    + (len > 1 ? (static_cast<uint32_t>(s[1]) << 8) : 0)
    + (len > 2 ? (static_cast<uint32_t>(s[2]) << 16) : 0)
    + (len > 3 ? (static_cast<uint32_t>(s[3]) << 24) : 0);
}
 
constexpr uint32_t word32le(const char* s)
{
  return word32le(s, 4);
}
 
constexpr uint32_t murmur3_32_loop(
    const char* key, int len, uint32_t hash)
{
  return len == 0 ? hash :
    murmur3_32_loop(
        key + 4,
        len - 1,
        murmur3_32_hashround(
            murmur3_32_k(word32le(key)), hash));
}

So this is the first part of Murmur3 that will process all the 4-byte chunks of key. What remains is to deal with the trailing bytes (up to 3 of them) and then finalize the hash. To deal with the trailing bytes, I chain the end functions (_end0 through _end3) and “jump in” at the right place. There is probably a more elegant way to do this, but…

constexpr uint32_t murmur3_32_end0(uint32_t k)
{
  return (((k*0xcc9e2d51) << 15)
          | ((k*0xcc9e2d51) >> 17)) * 0x1b873593;
}
 
constexpr uint32_t murmur3_32_end1(uint32_t k,
                                   const char* key)
{
  return murmur3_32_end0(
      k ^ static_cast<uint32_t>(key[0]));
}
 
constexpr uint32_t murmur3_32_end2(uint32_t k,
                                   const char* key)
{
  return murmur3_32_end1(
      k ^ (static_cast<uint32_t>(key[1]) << 8), key);
}
 
constexpr uint32_t murmur3_32_end3(uint32_t k,
                                   const char* key)
{
  return murmur3_32_end2(
      k ^ (static_cast<uint32_t>(key[2]) << 16), key);
}
 
constexpr uint32_t murmur3_32_end(uint32_t hash,
                                  const char* key, int rem)
{
  return rem == 0 ? hash :
    hash ^ (rem == 3 ? murmur3_32_end3(0, key) :
            rem == 2 ? murmur3_32_end2(0, key) :
            murmur3_32_end1(0, key));
}

Finalizing the hash is a very similar affair, with 3 stages:

constexpr uint32_t murmur3_32_final1(uint32_t hash)
{
  return (hash ^ (hash >> 16)) * 0x85ebca6b;
}
constexpr uint32_t murmur3_32_final2(uint32_t hash)
{
  return (hash ^ (hash >> 13)) * 0xc2b2ae35;
}
constexpr uint32_t murmur3_32_final3(uint32_t hash)
{
  return (hash ^ (hash >> 16));
}
 
constexpr uint32_t murmur3_32_final(uint32_t hash, int len)
{
  return
    murmur3_32_final3(
        murmur3_32_final2(
            murmur3_32_final1(hash ^ static_cast<uint32_t>(len))));
}

And that’s all there is to it: all that remains is to stitch the pieces together and provide the driver function:

constexpr uint32_t murmur3_32_value(const char* key, int len,
                                    uint32_t seed)
{
  return murmur3_32_final(
      murmur3_32_end(
          murmur3_32_loop(key, len/4, seed),
          key+(len/4)*4, len&3),
      len);
}
 
constexpr uint32_t murmur3_32(const char *key, uint32_t seed)
{
  return true ?
    murmur3_32_value(key, strlen(key), seed) :
    throw err::murmur3_32_runtime_error;
}

Doing strlen non-naively

One more thing: strlen crept in there. This is of course a constexpr version of strlen. The naive way to do this is to step down the string linearly, as seen in the FNV1 algorithm. With a max recursive depth of 512, I thought this fairly limiting… it might not be uncommon to have 1k string literals that I want to hash?

So the way I implemented strlen was to measure the string by chunks, constraining a max recursion depth of say 256. Instead of a plain linear recursion, it’s basically recursing for every 256 bytes of string length, and then recursing on the chunk inside that. So the max depth is 256 + 256 = 512, and we can deal with strings close to 64k in size (depending on how deep we are already when calling strlen). There’s a chance that strlen might be made constexpr in the future – I think it’s already implemented as an intrinsic in some compilers. So maybe I can throw away that code someday.

Anyway, now we have a complete implementation of 32-bit Murmur3 in C++11 constexpr style, and the tests go something like this (and can be checked online):

static_assert(murmur3_32("hello, world", 0) == 345750399,
              "murmur3 test 1");
static_assert(murmur3_32("hello, world1", 0) == 3714214180,
              "murmur3 test 2");
static_assert(murmur3_32("hello, world12", 0) == 83041023,
              "murmur3 test 3");
static_assert(murmur3_32("hello, world123", 0) == 209220029,
              "murmur3 test 4");
static_assert(murmur3_32("hello, world1234", 0) == 4241062699,
              "murmur3 test 5");
static_assert(murmur3_32("hello, world", 1) == 1868346089,
              "murmur3 test 6");

OK. That was a lot of code for a blog post. But after the trivial FNV1 and the only slightly harder Murmur3, I wanted more of a challenge. What about MD5? Or SHA-256? That’s for the next post.