
Flaky tests are nothing new - every engineer has experienced them. When deadlines are tight - it's so much easier to click rerun than dive deep into root cause analysis. The bigger the product - the higher the complexity of the investigation process. At Coder we run thousands of tests across many jobs - most of them involve networking, concurrency and are inherently vulnerable to be flaky. Some stem from OS differences, some from networking intricacies and others are just a mistake somebody made.
Our first step a long time ago was to report flaky tests to our metrics backend and create Slack alerts in our #dev channel. Whenever a merge to main would cause a failing build the engineer responsible would identify the failing test, create, or update an issue, and use git blame to determine an assignee.
In the first half of this year we made a decision to declare war on flakes.
Investigating a failed CI run after merge to main is a structured, rule-based process focused on compiling information from multiple sources. The usual process, at a high level, involved: reading logs, checking existing issues for similar occurrences, filling out issues with the new run details and checking git blame. We weren’t doing our best at it - sometimes people forgot, sometimes the check wasn’t as methodical as it could have been.
Solving this process algorithmically is near impossible - the amount of edge cases to cover is almost infinite. At that moment we realized that this could be a perfect job for an AI agent to figure out. However such a bot requires a lot of wiring - it needs GitHub credentials to read job logs and issues, an environment to clone the repo and run git blame, and we need to be able to begin it's investigation into the flake automatically. Luckily for us, our Research department has recently delivered a little something called Blink.
Blink is Coder’s self-hosted framework for building and deploying AI agents. It connects organizational context and enables communications through Slack. When you first run Blink it will run you through a wizard-like experience of setting up the Slack app and connecting your LLM provider. Once configured, Blink opens a chat interface where you can customize and develop the agentic reasoning through AI assisted coding. Blink automatically handled the infrastructure wiring that normally takes days, giving us a major headstart and chance to focus on deploying the flake investigator to production.
When we first started building the flake bot, we made the classic mistake: we wrote a prompt that sounded like a job description.
You are a CI flake investigator. When a test fails, analyze the logs,
determine the root cause, and create a GitHub issue with the right owner.
This produced exactly what you'd expect: confident, plausible-sounding investigations that were wrong in subtle ways. The bot would assign issues to whoever triggered the CI run (almost never the right person). It also reported quite a few false positives.
Our engineers already built a great mental model through dozens of flake investigations. They knew to use the grep command for signal: killed when tests timeout mysteriously. They knew that matrix job cancellations after a rerun are noise, not signal.
This knowledge lived in Slack threads, in code review comments, in the heads of people who'd been burned before.
So we wrote it down. All of it.
1. Failure classification logic
We explicitly enumerate the failure modes the bot should recognize:
This went way beyond documentation. We built an entire decision tree. When the bot sees a failure, it now asks "which of these five categories does this fit?" instead of freestyling an explanation.
2. Detection patterns as literal code
We don't tell the bot to "look for signs of a panic." We give it the exact grep commands:
This matters because LLMs are pattern-matchers, and we're giving them the patterns. When we added the data race detection section—complete with the exact WARNING: DATA RACE string that Go's race detector emits—our accuracy on race-related flakes jumped immediately.
3. Explicit anti-patterns
Some of our most valuable prompt lines are prohibitions. This section exists because the bot kept making this exact mistake:
We also learned to call out specific false-positive scenarios:
4. Quality checklists
At the end of the prompt, we include an explicit checklist:
Checklists make the model way less likely to skip steps. The format implies "all of these, every time."
5. Output templates
We specify exact title formats because consistency matters for searchability:
When every flake issue follows the same naming convention, duplicate detection actually works.
We introduced this on October 1st, 2025. Since that the bot has analyzed and identified 113 issues out of which only 7 ended up being reassigned to different owners. 94% correctness ain’t that bad considering the bot itself responded to 95% of incoming requests. 5% of failures were mostly Blink bugs we’ve been fixing along the way (exceeding the context window, slack webhook behavior, etc.).
Building this runbook taught us what good prompting actually looks like: take implicit knowledge from experts, make it explicit, and refine it through iteration. That's exactly what we did when moving from "investigate flakes" to a 200-line decision tree with grep commands and quality checklists. The same discipline applies after deployment.
We version control our system prompt and review changes like code. When the bot makes a mistake, we don't just correct the output—we ask "what instruction would have prevented this?" and add it to the prompt.
Some recent additions:
The prompt is a living document of everything that's ever gone wrong (just like an older codebase covering all the edge cases).
There's nothing magical here. Our "AI-powered flake investigation system" is, at its core, a very long checklist being executed by a language model with tool access.
But that's precisely what makes it work. The LLM provides the flexibility to handle natural language logs, the judgment to weigh ambiguous evidence, and the tirelessness to run the same investigation at 3 AM that it runs at 10 AM. The prompt provides the structure to make that judgment reliable.
The best LLM prompts aren't clever. They're comprehensive. They're the runbook you always meant to write but never had time for—except now, something actually follows it.
Want to stay up to date on all things Coder? Subscribe to our monthly newsletter and be the first to know when we release new things!