As soon as Claude Code was released, we were chomping at the bit to see what it could actually do. Not on toy problems, but on our real development work. So we started feeding it real GitHub issues from our open-source and internal repositories — issues created by both community members and Coder engineers — to see how well it could tackle actual engineering problems.
This post covers three of those experiments. For each one, we break down what happened, how close Claude Code got to a working solution, where it struggled, and what we learned. These aren’t cherry-picked success stories. We wanted to see the full range, from wins to failures to everything in between.
Claude Code handled some tasks impressively and stumbled on others, but it’s effective enough that some of our developers use it every day. What makes that more encouraging is that we only gave it GitHub issue links and a few small nudges. It had to explore the repo and figure out the rest on its own.
Documented by: Hugo, Growth Engineer at Coder. I contribute to our core product and internal tooling, doing both backend and frontend work.
Task: We have an internal admin dashboard built in Next.js. It's about 20,000 lines of code. Among other things, it displays GitHub issues created by our customers. We wanted to make the issues sortable by when they were created and updated.
Outcome (TL;DR): After prompting Claude Code with the issue title "Allow sorting /customer-issues by creation/update time," it simply carried out the task. Without additional instructions, it identified the relevant files, added a toggle to the UI for selecting the sort preference, and implemented the sorting logic. I reviewed the code and asked in a single sentence to also allow choosing the sort direction. It implemented that too.
I noticed it had mistakenly flipped the meanings of "ascending" and "descending," so I edited a single line of code to fix it. While testing the UI, I discovered a bug in the existing human-written implementation: it mutated an object containing the issue list on every React render, causing some text to duplicate after re-rendering. I described the bug as I saw it in the browser and mentioned that I suspected a mutated object was the cause. Claude Code found the faulty code and fixed it. Finally, I asked it to submit a pull request, and it did.
The result was 114 lines added and 28 deleted in a single file.
Learnings:
Cost: Less than $5.
Time: 5 minutes of AI work, 40 minutes of me reviewing the code and verifying the changes were correct.
What we’d do differently: Not much — Claude Code did great.
Documented by: Thomas Kosiewski, Senior Software Engineer. I work in the networking team of Coder in both the backend and the macOS native parts.
Task: https://github.com/coder/coder-desktop-macos/pull/105 When dealing with Apple’s entitlements and signing process, we were encountering issues with apparently macOS losing the stapled certificates or other unknown processes occurring in the background on the machine. There is an open issue on the Apple Forums about this happening to other folks years ago, but Apple moves in mysterious, yet important ways, to quote Severance here. Throwing the AI at this issue, is more of an attempt to potentially find something that might be hidden in some deep Apple docs or some obscure and dated APIs, that are not easily accessible anymore.
Outcome (TL;DR): The AI managed to produce code after multiple back and forth that compiles and works on macOS. Somewhat remarkable given the many language changes Swift has seen over the years. As this was mostly systems related and required a bigger amount of human interaction with macOS (deactivating and removing a network extension), it was a 30/70 split between Claude code working and me verifying and prompting it.
Learnings:
Cost: ~$10
Time: ~30 min API time, ~90 minutes total time
What we’d do differently: Prompting Claude Code to write “production ready” code and providing it with enough context and memory to produce high quality code is yet to be researched and understood, in my opinion. Without prior knowledge and experience, being able to quickly add any feature or showcase an idea of how something might or should behave is amazing, getting it over the finish line is an entirely different challenge.
Documented by: Cian, Staff Engineer at Coder. Works primarily on backend tasks, familiar with container orchestration tools, and author of this particular part of the codebase. Has used AI tools casually in the past.
Task: Modify a JSON REST API endpoint that invokes a predefined command to return a different part of the command output: https://github.com/coder/coder/pull/16866
Initial prompt: *"The agent/agentcontainers package defines a DockerCLILister. Part of the functionality of this struct is to enable listing running containers. There is a bug in the existing implementation where the returned port in the WorkspaceAgentListContainersResponse is incorrect. It should return the port of the container accessible from outside the container, but currently returns the port inside the container." *
Outcome (TL;DR): The agent partially solved the task, but required some re-prompting to arrive at this partial solution with explicit instructions to fetch data from NetworkSettings.Ports and to de-duplicate host ports mapped to the same container port (note: that there is some additional fine detail to this that wasn't originally apparent to me either).
It also needed to be explicitly prompted to add test coverage using the default settings – this could likely be improved by modifying the prompt settings – and had to be prompted to re-run all tests in the package after making changes.
I also noted a propensity of the agent to write overly complex code (see LHS of commit 077d5a) related to comparing expected versus actual test outputs.
Ultimately, I decided that there were insufficient underlying test data to ensure a correct solution, and elected to raise a separate PR with appropriate test fixtures. It is possible that this would have helped the agent in finding a correct solution.
Learnings:
Cost: $2.65
Time: ~1h wall time (included initial setup), 7m 34s API time
What we’d do differently:
These experiments gave us a clearer sense of what AI agents can and can’t do today. Claude Code did well on small, well-defined tasks in familiar frameworks like React and Next.js. It struggled when deeper reasoning, system complexity, or poorly documented APIs were involved. What made this possible was running AI in isolated Coder workspaces, giving it a safe place to experiment without putting real code at risk. And while AI still needs human supervision, some of us are already using it daily to get through parts of our work faster.
We also came away with a better sense of costs, which is critical if you’re thinking about using AI at scale. Knowing what kinds of tasks AI can handle, how much it costs, and how to control that work safely is where the value starts to show. AI isn’t magic, but when paired with the right environment, it can be useful and is improving quickly. If you’re exploring this space too, we’d love to hear what you’re learning.
Next, we’re planning a follow-up where we explore how far Claude Code can go with better prompts, more context, and internal tooling to make our codebase more AI-friendly. Stay tuned for part two.