Mar 13 20268 min read

In a Blink: Automating the worst part of CI maintenance

Every flake comes with a cost

Flaky tests are nothing new - every engineer has experienced them. When deadlines are tight - it's so much easier to click rerun than dive deep into root cause analysis. The bigger the product - the higher the complexity of the investigation process. At Coder we run thousands of tests across many jobs - most of them involve networking, concurrency and are inherently vulnerable to be flaky. Some stem from OS differences, some from networking intricacies and others are just a mistake somebody made.

Our first step a long time ago was to report flaky tests to our metrics backend and create Slack alerts in our #dev channel. Whenever a merge to main would cause a failing build the engineer responsible would identify the failing test, create, or update an issue, and use git blame to determine an assignee.

In the first half of this year we made a decision to declare war on flakes.

The boring parts

Investigating a failed CI run after merge to main is a structured, rule-based process focused on compiling information from multiple sources. The usual process, at a high level, involved: reading logs, checking existing issues for similar occurrences, filling out issues with the new run details and checking git blame. We weren’t doing our best at it - sometimes people forgot, sometimes the check wasn’t as methodical as it could have been.

Solving this process algorithmically is near impossible - the amount of edge cases to cover is almost infinite. At that moment we realized that this could be a perfect job for an AI agent to figure out. However such a bot requires a lot of wiring - it needs GitHub credentials to read job logs and issues, an environment to clone the repo and run git blame, and we need to be able to begin it's investigation into the flake automatically. Luckily for us, our Research department has recently delivered a little something called Blink.

One Blink to Bind Them All

Blink is Coder’s self-hosted framework for building and deploying AI agents. It connects organizational context and enables communications through Slack. When you first run Blink it will run you through a wizard-like experience of setting up the Slack app and connecting your LLM provider. Once configured, Blink opens a chat interface where you can customize and develop the agentic reasoning through AI assisted coding. Blink automatically handled the infrastructure wiring that normally takes days, giving us a major headstart and chance to focus on deploying the flake investigator to production.

Our system prompt is a 200-line runbook — and that's the point

When we first started building the flake bot, we made the classic mistake: we wrote a prompt that sounded like a job description.

You are a CI flake investigator. When a test fails, analyze the logs, determine the root cause, and create a GitHub issue with the right owner.

This produced exactly what you'd expect: confident, plausible-sounding investigations that were wrong in subtle ways. The bot would assign issues to whoever triggered the CI run (almost never the right person). It also reported quite a few false positives.

Our engineers already built a great mental model through dozens of flake investigations. They knew to use the grep command for signal: killed when tests timeout mysteriously. They knew that matrix job cancellations after a rerun are noise, not signal.

This knowledge lived in Slack threads, in code review comments, in the heads of people who'd been burned before.

So we wrote it down. All of it.

1. Failure classification logic

We explicitly enumerate the failure modes the bot should recognize:

This went way beyond documentation. We built an entire decision tree. When the bot sees a failure, it now asks "which of these five categories does this fit?" instead of freestyling an explanation.

2. Detection patterns as literal code

We don't tell the bot to "look for signs of a panic." We give it the exact grep commands:

This matters because LLMs are pattern-matchers, and we're giving them the patterns. When we added the data race detection section—complete with the exact WARNING: DATA RACE string that Go's race detector emits—our accuracy on race-related flakes jumped immediately.

3. Explicit anti-patterns

Some of our most valuable prompt lines are prohibitions. This section exists because the bot kept making this exact mistake:

We also learned to call out specific false-positive scenarios:

4. Quality checklists

At the end of the prompt, we include an explicit checklist:

Checklists make the model way less likely to skip steps. The format implies "all of these, every time."

5. Output templates

We specify exact title formats because consistency matters for searchability:

When every flake issue follows the same naming convention, duplicate detection actually works.

Where we ended up?

We introduced this on October 1st, 2025. Since that the bot has analyzed and identified 113 issues out of which only 7 ended up being reassigned to different owners. 94% correctness ain’t that bad considering the bot itself responded to 95% of incoming requests. 5% of failures were mostly Blink bugs we’ve been fixing along the way (exceeding the context window, slack webhook behavior, etc.).

The prompt is never done

Building this runbook taught us what good prompting actually looks like: take implicit knowledge from experts, make it explicit, and refine it through iteration. That's exactly what we did when moving from "investigate flakes" to a 200-line decision tree with grep commands and quality checklists. The same discipline applies after deployment.

We version control our system prompt and review changes like code. When the bot makes a mistake, we don't just correct the output—we ask "what instruction would have prevented this?" and add it to the prompt.

Some recent additions:

"If you can't download all the logs from GitHub, even after retrying a few times, you can assume you don't have the full picture, and should not investigate the failure." (Added after the bot confidently diagnosed a flake based on partial logs)
"Check the date of the failure. It should be within a few minutes of the Slack message. If it's not, you probably have the wrong failing job." (Added after the bot investigated a week-old failure that happened to match a search query)

The prompt is a living document of everything that's ever gone wrong (just like an older codebase covering all the edge cases).

The Uncomfortable Truth

There's nothing magical here. Our "AI-powered flake investigation system" is, at its core, a very long checklist being executed by a language model with tool access.

But that's precisely what makes it work. The LLM provides the flexibility to handle natural language logs, the judgment to weigh ambiguous evidence, and the tirelessness to run the same investigation at 3 AM that it runs at 10 AM. The prompt provides the structure to make that judgment reliable.

The best LLM prompts aren't clever. They're comprehensive. They're the runbook you always meant to write but never had time for—except now, something actually follows it.

Additional materials:

Mike Suchacz

Engineering Manager, Coder

Subscribe to our newsletter

Want to stay up to date on all things Coder? Subscribe to our monthly newsletter for the latest articles, workshops, events, and announcements.