One of the things I like to do in my spare time is study the STL algorithms. It is easy to take them for granted and easy, perhaps, to imagine that they are mostly trivial. And some are: I would think that any decent interview candidate ought to be able to write `find`

correctly. (Although even such a trivial-looking algorithm is based on a thorough understanding of iterator categories.)

**Uncovering the overlooked**

But there are some algorithms that are non-trivial, and some are important building blocks. Take `inplace_merge`

. For brevity, let’s consider the version that just uses `operator<`

rather than being parametrized on the comparison. The one easily generalizes to the other in a way that is not important to the algorithm itself.

`template `
void inplace_merge(BidirectionalIterator first,
BidirectionalIterator middle,
BidirectionalIterator last);

It merges two consecutive sorted ranges into one sorted range. That is, if we have an input like this:

[latex display="true"]x_0 \: x_1 \: x_2 \: ... \: x_n \: y_0 \: y_1 \: y_2 \: ... \: y_m[/latex]

Where [latex]\forall i < n, x_i \leq x_{i+1}[/latex] and [latex]\forall j < m, y_j \leq y_{j+1}[/latex]

We get this result occupying the same space:

[latex display="true"]r_0 \: r_1 \: r_2 \: ... \: r_{n+m}[/latex]

Where [latex]\forall i < n+m, r_i \leq r_{i+1}[/latex] and the new range is a permutation of the original ranges. In addition, the standard states a few additional constraints:

`inplace_merge`

is stable - that is, the relative order of equivalent elements is preserved- it uses a
`BidirectionalIterator`

which shall be`ValueSwappable`

and whose dereferent (is that a word?) shall be`MoveConstructible`

and`MoveAssignable`

- when enough additional memory is available, (
`last-first-1`

) comparisons. Otherwise an algorithm with complexity`N log(N)`

(where`N = last-first`

) may be used

**Avenues of enquiry**

Leaving aside the possible surprise of discovering that an STL algorithm may allocate memory, some thoughts spring to mind immediately:

- Why does
`inplace_merge`

need a`BidirectionalIterator`

? *How much*memory is required to achieve O(n) performance? Is a constant amount enough?

And to a lesser extent perhaps:

- Why are
`merge`

and`inplace_merge`

not named the same way as other algorithms, where the more normal nomenclature might be`merge_copy`

and`merge`

? - What is it with the algorists' weasel-word "in-place"?

**First thoughts about the algorithm**

It seems that an O(n log n) algorithm should be possible on average, because in the general case, simply sorting the entire range produces the desired output. Although the sort has to be stable, which means something like merge sort, which leads us down a recursive rabbit hole. Hm.

At any rate, it's easy to see how to achieve `inplace_merge`

with no extra memory needed by walking iterators through the ranges:

`template `
void naive_inplace_merge(
ForwardIt first, ForwardIt middle, ForwardIt last)
{
while (first != middle && middle != last) {
if (*middle < *first) {
std::iter_swap(middle, first);
auto i = middle;
std::rotate(++first, i, ++middle);
} else {
++first;
}
}
}

After swapping (say) [latex]x_0[/latex] and [latex]y_0[/latex], the ranges look like this:

[latex display="true"]y_0 \: x_1 \: x_2 \: ... \: x_n \: x_0 \: y_1 \: y_2 \: ... \: y_m[/latex]

And the call to `rotate`

fixes up [latex]x_1 \: ... \: x_n \: x_0[/latex] to be ordered again. From there we proceed as before on the ranges [latex]x_0 \: ... \: x_n[/latex] and [latex]y_1 \: ... \: y_m[/latex].

This algorithm actually conforms to the standard! It has O(n) comparisons, uses no extra memory, and has the advantage that it works on `ForwardIterator`

! But unfortunately, it's O(nÂ²) overall in operations, because of course, `rotate`

is O(n). So how can we do better?

**Using a temporary buffer**

If we have a temporary buffer available that is equal in size to the smaller of the two ranges, we can move the smaller range to it, move the other range up if necessary, and perform a "normal" merge of the two ranges into the original space:

`template `
void naive_inplace_merge2(
BidirIt first, BidirIt middle, BidirIt last)
{
using T = typename std::iterator_traits::value_type;
auto d1 = std::distance(first, middle);
auto d2 = std::distance(middle, last);
auto n = std::min(d1, d2);
auto tmp = std::make_unique(n * sizeof(T));
T* begint = reinterpret_cast(tmp.get());
T* endt = begint + n;
if (d1 <= d2)
{
std::move(first, middle, begint);
std::merge(begint, endt, middle, last, first);
}
else
{
std::move(middle, last, begint);
auto i = std::move_backward(first, middle, last);
std::merge(i, last, begint, endt, first);
}
}

This is essentially the algorithm used by STL implementations if buffer space is available. And this is the reason why `inplace_merge`

requires `BidirectionalIterator`

: because `move_backward`

does.

(This isn't quite optimal: the `std::move_backward`

can be mitigated with reverse iterators and predicate negation, but the `BidirectionalIterator`

requirement remains. Also, strictly speaking, `std::merge`

is undefined behaviour here because one of the input ranges overlaps the output range, but we know the equivalent loop is algorithmically safe.)

Provisioning of the temporary buffer is also a little involved because we don't know that elements in the range are default constructible (and perhaps we wouldn't want to default-construct our temporaries anyway). So to deal correctly with non-trivial types here, `std::move`

should actually be a loop move-constructing values. And when `std::inplace_merge`

is used as a building block for e.g. `std::stable_sort`

, it would also be nice to minimize buffer allocation rather than having an allocation per call. Go look at your favourite STL implementation for more details.

**Thinking further**

The literature contains more advanced algorithms for merging if a suitably-sized buffer is not available: the basis for the STL's choice is covered in Elements of Programming chapter 11, and in general the work of Dudzinski & Dydek and of Kim & Kutzner seems to be cited a lot.

But I knew nothing of this research before tackling the problem, and attempting to solve it requiring just `ForwardIterator`

.

I spent a couple of evenings playing with how to do `inplace_merge`

. I covered a dozen or more A4 sheets of squared paper with diagrams of algorithms evolving. I highly recommend this approach! After a few hours of drawing and hacking I had a really good idea of the shape of things. Property-based testing came in very handy for breaking my attempts, and eventually led me to believe that a general solution on the lines I was pursuing would either involve keeping track of extra iterators or equivalently require extra space. Keeping track of iterators seemed a messy approach, so an extra space approach is warranted.

How much extra space? Consider the "worst case":

[latex display="true"]x_0 \: x_1 \: x_2 \: ... \: x_n \: y_0 \: y_1 \: y_2 \: ... \: y_m[/latex]

Assume for the moment that [latex]m \leq n[/latex]. When [latex]y_m < x_0[/latex], we need extra space to hold all of [latex]x_0 \: ... \: x_m[/latex]. If [latex]n \leq m[/latex] then we will need extra space for [latex]x_0 \: ... \: x_n[/latex] to likewise move them out of the way. Either way, the number of units of extra space we need is `min(n, m)`

.

As we move elements of [latex]x[/latex] into temporary storage, we can see that in general at each stage of the algorithm we will have a situation something like this (using [latex]Z[/latex] to mean a moved-from value):

[latex display="true"]... \: x_i \: ... \: x_n \: Z_0 \: ... \: Z_{j-1} \: y_j \: ... \: y_m[/latex]

With some values of [latex]x[/latex] moved into temporary storage:

[latex display="true"]x_{i-t} \: ... \: x_{i-1}[/latex]

The temporary storage here is a queue: we always push on to the end and pop from the beginning, since the elements in it start, and remain, ordered. Since we know an upper bound on the number of things in the queue at any one time, it can be a ring buffer (recently proposed) over a regular array of values.

**Sketching the start**

From this, we can start sketching out an algorithm:

- Allocate a buffer of size
`min(m, n)`

- call it`tmp`

- We'll walk the iterators along the
`x`

(`first`

) and`y`

(`middle`

) ranges - The output iterator
`o`

will start at`first`

- The next
`x`

to consider will either be in-place in the`x`

range, or (if`tmp`

is not empty) in`tmp`

- call it`xlow`

- If
`*y`

<`*xlow`

move`*x`

to`tmp`

, move`*y`

to`o`

, inc`y`

- Otherwise, if
`*xlow`

is in`tmp`

, move`*x`

to`tmp`

and`*xlow`

from`tmp`

to`o`

- inc
`o`

, inc`x`

- if
`y`

<`last`

and`o`

<`middle`

, goto 4 - deal with the end(s) of the ranges

**Dealing with the end**

This gets us as far as exhausting the smaller range: after this, we will be in one of two situations.

*Situation 1*. If we have exhausted the [latex]y[/latex] range, things look like this:

[latex display="true"]... \: x_i \: ... \: x_n \: Z_0 \: ... \: Z_m[/latex]

With values of [latex]x[/latex] in temporary storage:

[latex display="true"]x_{i-t} \: ... \: x_{i-1}[/latex]

To fix up this situation, we can repeatedly swap the `tmp`

range with the equivalent `x`

range until we reach `middle`

(i.e [latex]Z_0[/latex]), and then simply move the remaining `tmp`

values into place.

I originally wrote a loop repeatedly swapping the values in `tmp`

right up to the end; but I realised that would involve swapping a moved-from object, which would be wrong (it might work... until it doesn't). Moved-from objects should either be destroyed or made whole (assigned to); nothing else.

*Situation 2*. The possibility is that we have exhausted the [latex]x[/latex] range, in which case things look like this:

[latex display="true"]... \: Z_0 \: ... \: Z_{n-i} \: y_j \: ... \: y_m[/latex]

With values of [latex]x[/latex] in temporary storage:

[latex display="true"]x_i \: ... \: x_n[/latex]

To fix up this situation, we can just do a regular `merge`

on the remaining `y`

range and `tmp`

, outputting starting at middle (i.e [latex]Z_0[/latex]). (With the same proviso as before about undefined behaviour with overlapping ranges.) We know that it will be safe to do a loop equivalent to `merge`

, because we have exactly the required amount of space before [latex]y_j[/latex] to fit [latex]x_i \: ... \: x_n[/latex]. This is the same as the STL's normal buffered merge strategy.

**Final thoughts**

I tackled this exercise from scratch, knowing nothing about actual implementations of `inplace_merge`

. This algorithm does some extra housekeeping under the hood, but:

- it does the minimum number of comparisons
- each element is moved at most twice: into
`tmp`

and out again - it needs only
`ForwardIterator`

Optimization and benchmarking under differing conditions of range size, comparison and move cost is left as an exercise to the reader...

I cannot recommend Elements of Programming enough. I am partway through reading it; after this exercise I skipped to chapter 11 to see what it said. Every time I dive into the STL algorithms, I am re-impressed by the genius of Alex Stepanov: Paul McJones' recent talk The Concept of Concept explains this well, in particular the key role of concepts in the STL in service of algorithmic purity. Alex knew about concepts from the start: it's taken C++ over 2 decades to catch up.

After doing this work, I discovered a recent proposal that calls for weakening the iterator categories of `inplace_merge`

and related algorithms.

An implementation of this algorithm is on github. It's just a sketch, written for clarity rather than optimality. This has been a fun exercise.