Introducing Quartz: A Deterministic Time Testing Library for Go - Blog

I hate flaky tests.

We repeatedly run our suite of tests during development time and on every commit, so if a test is flaky, developers quickly learn not to trust it. It’s so easy to just retry and assume everything is fine. This is the worst of both worlds because the test isn’t providing value as an indication that something needs to be fixed and it drags out the review-merge cycle with test reruns.

Across a modest-sized development team, even a test that flakes 1 in 100 will burn one of your devs every day on average.

Today we’re going to dive deep into one particular area that’s hard to get right: code that depends on time.

Repeatability and Determinism

If your code is supposed to do something at 8pm every day, you can’t just wait around until 8pm to test that it does the right thing. Similarly, if your code is supposed to do something after 30 minutes, you probably don’t want to wait 30 minutes to know if it’s working correctly.

One approach is to make the values “8pm” and “30 minutes” configurable, and choose them at test time. This can work, but it has drawbacks.

One is repeatability: the tests will be operating on different times each run, and on different times than they would in production. I’ve seen flakes related to daylight savings time and whether the test happens to start at zero minutes past the hour.

The second is determinism. Compress 30 minutes into 30ms and your tests may no longer do things in the order you expect, since consumer operating systems don’t make hard guarantees about real-time processing. Writing tests this way is an explicit balancing act between fast and flake-free execution. It’s hard to get right, especially since development environments tend to have fast hardware and not a lot of resource contention, whereas CI runs on constrained and shared hardware.

Introducing Quartz

Today we are introducing Quartz, a mocking library in Go for testing code that depends on time.

By mocking out calls that query or depend on the real time, we can write unit tests that are repeatable, deterministic, and execute as quickly as the CPU allows. This is, in itself, not a new idea. In building Quartz, we took some inspiration from:

Quartz shares the high level design of a Clock interface that closely resembles the functions in the time standard library, and a "real" clock passes through to the standard library in production, while a mock clock gives precise control in testing.

For several reasons, it is a tall order to write tests that are repeatable, deterministic and fast, when it comes to code that depends on time. The rest of this blog post will discuss what these reasons are and why we found these existing libraries insufficient. If you are just ready to dive into unit testing with Quartz, head over to the repository now and look at the README and examples.

I also want to give a big thank you to the authors of the above libraries. Discussing the limitations of the libraries in the next sections is not meant to imply these libraries are bad or not useful. To the extent that Quartz surpasses these limitations, it is only by standing on their shoulders.

Preventing Test Flakes

The following example comes from the README from benbjohnson/clock:

mock := clock.NewMock()
count := 0

// Kick off a timer to increment every 1 mock second.
go func() {
    ticker := mock.Ticker(1 * time.Second)
    for {
        <-ticker.C
        count++
    }
}()
runtime.Gosched()

// Move the clock forward 10 seconds.
mock.Add(10 * time.Second)

// This prints 10.
fmt.Println(count)

The first race condition is fairly obvious: moving the clock forward 10 seconds may generate 10 ticks on the ticker.C channel, but there is no guarantee that count++ executes before fmt.Println(count).

The second race condition is more subtle, but runtime.Gosched() is the tell. Since the ticker is started on a separate goroutine, there is no guarantee that mock.Ticker() executes before mock.Add(). runtime.Gosched() is an attempt to get this to happen, but it makes no hard promises. On a busy system, especially when running tests in parallel, this can flake, advance the time 10 seconds first, then start the ticker and never generate a tick.

Let's talk about how Quartz tackles these problems.

In our experience, an extremely common use case is creating a ticker then doing a 2-arm select with ticks in one and context expiring in another, i.e.

t := time.NewTicker(duration)
for {
    select {
    case <-ctx.Done():
        return ctx.Err()
    case <-t.C:
        err := do()
        if err != nil {
            return err
        }
    }
}

In Quartz, we refactor this to be more compact and testing friendly:

t := clock.TickerFunc(ctx, duration, do)
return t.Wait()

This affords the mock Clock the ability to explicitly know when processing of a tick is finished because it's wrapped in the function passed to TickerFunc (do() in this example).

In Quartz, when you advance the clock, you are returned an object you can Wait() on to ensure all ticks and timers triggered are finished. This solves the first race condition in the example.

(As an aside, we still support a traditional standard library-style Ticker. You may find it useful if you want to keep your code as close as possible to the standard library, or if you need to use the channel in a larger select block. In that case, you'll have to find some other mechanism to sync tick processing to your test code.)

To prevent race conditions related to the starting of the ticker, Quartz allows you to set "traps" for calls that access the clock.

func TestTicker(t *testing.T) {
    mClock := quartz.NewMock(t)
    trap := mClock.Trap().TickerFunc()
    defer trap.Close() // stop trapping at end
    go runMyTicker(mClock) // async calls TickerFunc()
    call := trap.Wait(context.Background()) // waits for a call and blocks its return
    call.Release() // allow the TickerFunc() call to return
    // optionally check the duration using call.Duration
    // Move the clock forward 1 tick
    mClock.Advance(time.Second).MustWait(context.Background())
    // assert results of the tick
}

Trapping and then releasing the call to TickerFunc() ensures the ticker is started at a deterministic time, so our calls to Advance() will have a predictable effect.

Take a look at TestExampleTickerFunc in example_test.go for a complete worked example.

Complex Time Dependence

Another difficult issue to handle when unit testing is when some code under test makes multiple calls that depend on the time, and you want to simulate some time passing between them.

A very basic example is measuring how long something took:

var measurement time.Duration
go func(clock quartz.Clock) {
    start := clock.Now()
    doSomething()
    measurement = clock.Since(start)
}(mClock)

// how to get measurement to be, say, 5 seconds?

The two calls into the clock happen asynchronously, so we need to be able to advance the clock after the first call to Now() but before the call to Since(). Doing this with the libraries we mentioned above means that you have to be able to mock out or otherwise block the completion of doSomething().

But, with the trap functionality we mentioned in the previous section, you can deterministically control the time each call sees.

trap := mClock.Trap().Since()
var measurement time.Duration
go func(clock quartz.Clock) {
    start := clock.Now()
    doSomething()
    measurement = clock.Since(start)
}(mClock)

c := trap.Wait(ctx)
mClock.Advance(5*time.Second)
c.Release()

We wait until we trap the clock.Since() call, which implies that clock.Now() has completed, then advance the mock clock 5 seconds. Finally, we release the clock.Since() call. Any changes to the clock that happen before we release the call will be included in the time used for the clock.Since() call.

As a more involved example, consider an inactivity timeout: we want something to happen if there is no activity recorded for some period, say 10 minutes in the following example:

func (i *InactivityTimer) Start() {
    i.mu.Lock()
    defer i.mu.Unlock()
    next := i.clock.Until(i.activity.Add(10*time.Minute))
    t := i.clock.AfterFunc(next, func() {
        i.mu.Lock()
        defer i.mu.Unlock()
        next := i.clock.Until(i.activity.Add(10*time.Minute), "inner")
        if next == 0 {
            i.timeoutLocked()
            return
        }
        t.Reset(next)
    })
}

All Quartz Clock functions, and functions on returned timers and tickers support zero or more string tags that allow traps to match on them.

func TestInactivityTimer_Late(t *testing.T) {
    // set a timeout on the test itself, so that if Wait functions get blocked, we don't have to
    // wait for the default test timeout of 10 minutes.
    ctx, cancel := context.WithTimeout(10*time.Second)
    defer cancel()
    mClock := quartz.NewMock(t)
    trap := mClock.Trap.Until("inner")
    defer trap.Close()

    it := &InactivityTimer{
        activity: mClock.Now(),
        clock: mClock,
    }
    it.Start()

    // Trigger the AfterFunc
    w := mClock.Advance(10*time.Minute)
    c := trap.Wait(ctx)
    // Advance the clock a few ms to simulate a busy system
    mClock.Advance(3*time.Millisecond)
    c.Release() // Until() returns
    w.MustWait(ctx) // Wait for the AfterFunc to wrap up

    // Assert that the timeoutLocked() function was called
}

This test case will fail with our bugged implementation, since the triggered AfterFunc won't call timeoutLocked() and instead will reset the timer with a negative number. The fix is easy, use next <= 0 as the comparison.

Correspondence Principle

One final criterion in designing Quartz’s API was the desire to be able to write unit tests that are easy to understand.

Time flows in one direction, from past to future, and I wanted unit tests written with Quartz to have the same monotonic flow and definite ordering. As you read a unit test from top to bottom, time advances forward, and you can define explicit breaks in the flow to inspect and assert the state of code under test, like breakpoints in debugger, or hitting the pause button while playing back a video.

Consider this example unit test from the Quartz repository. The color coding shows how time advances as the unit test executes.

From the perspective of the unit test function, the clock is always in a well defined state, and only changes state explicitly and synchronously.

The second ingredient for writing repeatable, deterministic unit tests is for the unit under test to be in a well defined state, with no races, when you make your test assertions. Quartz cannot guarantee this on its own, but it is designed to make this as easy as possible via

Waiting for Advance/AdvanceNext calls to complete. When paired with AfterFunc and TickerFunc, this allows you to wait for the unit under test to complete its timeout or tick processing. You can then assert the results of that processing without any races.
Trapping calls into the Quartz library via the Clock interface, Timers and/or Tickers. These allow you to suspend execution of the unit under test at deterministic points, where you can make assertions or advance the clock.

These two properties–clock in a well defined state, unit under test in a well defined state–mean that you can make test assertions without worry that they will flake. Polling (e.g. “Eventually”), sleeping, or calling runtime.Gosched() within your tests are all indications that your tests are not fully deterministic. It’s still possible to write tests that don’t flake this way, but it’s harder and you make trade-offs with speed.

Present and Future of Quartz

The current release is v0.1.0, reflecting the fact that we are not yet making an ironclad promise not to change the API before v1.0. I don’t anticipate any changes, but it seems premature to make that promise before Quartz sees much deployment outside of Coder.

Having used Quartz internally at Coder for some time, we feel it is stable enough for general, public use. Just don’t robotically upgrade the library without checking until we declare the v1.0 API, likely later this year.

Please give Quartz a try. If you like it, star it! If you have any issues or feature requests, please write a GitHub issue on the repo.

¹ You may retort that testing at different times of day and seasons of the year gives you more test coverage. For what it’s worth, both these flakes were test bugs related to computing the expected outputs, not product bugs. To the extent your code is sensitive to these factors, you want test cases for them, not the luck of the draw of when the test is run. That is, you don’t want to wait around until daylight savings time to know whether your code will work when it does.