A Benchmark in a Unit Test - Oh crap, it actually works

Lately I've been working towards 100%^[1] test coverage in RandN. This has been very helpful and has already found several small bugs across the library. One of the RNGs I'm testing is ThreadLocalRng, which RandN guarantees is thread-safe by use of a ThreadLocal - a wrapper around ChaCha that maintains exactly one instance per thread. Another test already verifies that a unique instance is created for each thread, but there could still be some sort of hidden dependency within ChaCha, so I want to test it even further, even though threading is notoriously difficult to test reliably.

Enter ClusterThread - a unit test running several threads concurrently (up to 8 right now), all generating random numbers using ThreadLocalRng. To ensure that the threads ran concurrently for a decent length of time, I had each thread to generate 500,000,000 32-bit integers; my laptop can run ChaCha at about 500 megabytes per second, so it should take around 2 seconds for this test to run. In reality, it took a couple minutes - what the heck? After a couple minutes of pondering, I realized that, duh, I run tests in Debug configuration, so there's no optimization. I dropped it down to 50,000, which completed in the expected amount of time.

But this got me thinking - what if I ran it on my much more powerful desktop? Or if I ran it in release mode? This would mean the test runs too fast and some threads might finish too quickly. Similarly, if it's run on .NET Framework (with no support for x86 intrinsics) or a much slower computer, it'll take way too long to complete. How can I adjust the test to the speed of the environment? Why, run a benchmark of course.

const Int32 benchmarkIterations = 4000;
var stopwatch = Stopwatch.StartNew();
for (Int32 i = 0; i < benchmarkIterations; i++)
    ThreadLocalRng.Instance.NextUInt32();
stopwatch.Stop();

The benchmark runs first, before any threads are spawned. It generates a fixed quantity of numbers, and times how long this takes. The final quantity of numbers generated is then scaled up so that it takes about one second to run the test, regardless of the strength of the hardware it's running on.

targetTime.TotalMilliseconds / stopwatch.Elapsed.TotalMilliseconds;
var iterations = (Int32)(benchmarkIterations * multiplier);

Given that it depends on the behavior of threads and the length of time it takes to run, this test has out to be more of an integration test than a unit test, but I'm ok with that in this case. There's still a couple of things I want to play around with. I'll probably place a floor on the quantity to ensure we don't end up with poor testing on very weak hardware. I also want to reduce the run of time of the benchmark from one second to maybe a half second or quarter second, if I can verify that it's enough time for all the threads to spin up and run concurrently. Lastly, the test runs for a fixed number of threads (1, 2, 4, and 8 threads) - I'd be interesting in scaling this up automatically on systems with more cores available.

Full code

using System;
using System.Collections.Concurrent;
using System.Threading;
using Xunit;

[Theory]
[InlineData(1)]
[InlineData(2)]
[InlineData(4)]
[InlineData(8)]
public void ClusterThread(Int32 threadCount)
{
    // In this test, we spawn a bunch of threads and try to get them to use their
    // ThreadLocalRng concurrently, after which they add something to the completed bag. If
    // the bag doesn't have the same number of items in the bag as threads spawned, we know
    // that something went wrong.

    // We first run a quick benchmark to get a rough idea of how fast the RNG runs.
    const Int32 benchmarkIterations = 4000;
    var stopwatch = System.Diagnostics.Stopwatch.StartNew();
    for (Int32 i = 0; i < benchmarkIterations; i++)
        ThreadLocalRng.Instance.NextUInt32();
    stopwatch.Stop();

    // We want each thread to do about 1 second of work.
    TimeSpan targetTime = TimeSpan.FromMilliseconds(1000);
    Double multiplier = targetTime.TotalMilliseconds / stopwatch.Elapsed.TotalMilliseconds;
    Int32 iterations = (Int32)(benchmarkIterations * multiplier);

    var completed = new ConcurrentBag<Int32>();
    void DoWork()
    {
        var rng = ThreadLocalRng.Instance;
        for (Int32 i = 0; i < iterations; i++)
            rng.NextUInt32();

        completed.Add(iterations);
    }

    var threads = new Thread[threadCount];
    for (Int32 i = 0; i < threads.Length; i++)
    {
        threads[i] = new Thread(DoWork);
        threads[i].Start();
    }

    foreach (var thread in threads)
        thread.Join();

    Assert.Equal(threadCount, completed.Count);
}

It's not actually possible to get 100% code coverage since RandN has specialized code paths depending on processor features, like AVX2, SSE2, and endianness.