As soon as Claude Code was released, we were chomping at the bit to see what it could actually do. Not on toy problems, but on our real development work. So we started feeding it real GitHub issues from our open-source and internal repositories — issues created by both community members and Coder engineers — to see how well it could tackle actual engineering problems.
This post covers three of those experiments. For each one, we break down what happened, how close Claude Code got to a working solution, where it struggled, and what we learned. These aren’t cherry-picked success stories. We wanted to see the full range, from wins to failures to everything in between.
Claude Code handled some tasks impressively and stumbled on others, but it’s effective enough that some of our developers use it every day. What makes that more encouraging is that we only gave it GitHub issue links and a few small nudges. It had to explore the repo and figure out the rest on its own.
Task #1: New feature in an internal admin dashboard
Documented by: Hugo, Growth Engineer at Coder. I contribute to our core product and internal tooling, doing both backend and frontend work.
Task: We have an internal admin dashboard built in Next.js. It's about 20,000 lines of code. Among other things, it displays GitHub issues created by our customers. We wanted to make the issues sortable by when they were created and updated.
Outcome (TL;DR): After prompting Claude Code with the issue title "Allow sorting /customer-issues by creation/update time," it simply carried out the task. Without additional instructions, it identified the relevant files, added a toggle to the UI for selecting the sort preference, and implemented the sorting logic. I reviewed the code and asked in a single sentence to also allow choosing the sort direction. It implemented that too.
I noticed it had mistakenly flipped the meanings of "ascending" and "descending," so I edited a single line of code to fix it. While testing the UI, I discovered a bug in the existing human-written implementation: it mutated an object containing the issue list on every React render, causing some text to duplicate after re-rendering. I described the bug as I saw it in the browser and mentioned that I suspected a mutated object was the cause. Claude Code found the faulty code and fixed it. Finally, I asked it to submit a pull request, and it did.
The result was 114 lines added and 28 deleted in a single file.
Learnings:
- Coming into this task, I didn't expect Claude Code to be useful. My past experience with AI agents, like those in Cursor, was that their suggestions were unhelpful. But Claude Code made the correct changes right away with little direction.
- Using Claude Code was faster than writing the code myself. I wasn't the original author of the code that needed to be changed, and Claude Code figured out what to do sooner than I could.
- I believe its effectiveness came from a combination of the task's simplicity, the small codebase, and the use of off-the-shelf frameworks like Next.js and React.
- Claude Code's success with this task convinced me to use it more. I've found it fantastic for clear tasks in small codebases but less effective when the task is unclear or the codebase is large and complex.
Cost: Less than $5.
Time: 5 minutes of AI work, 40 minutes of me reviewing the code and verifying the changes were correct.
What we’d do differently: Not much — Claude Code did great.
Task #2: Native macOS development
Documented by: Thomas Kosiewski, Senior Software Engineer. I work in the networking team of Coder in both the backend and the macOS native parts.
Task: https://github.com/coder/coder-desktop-macos/pull/105 When dealing with Apple’s entitlements and signing process, we were encountering issues with apparently macOS losing the stapled certificates or other unknown processes occurring in the background on the machine. There is an open issue on the Apple Forums about this happening to other folks years ago, but Apple moves in mysterious, yet important ways, to quote Severance here. Throwing the AI at this issue, is more of an attempt to potentially find something that might be hidden in some deep Apple docs or some obscure and dated APIs, that are not easily accessible anymore.
Outcome (TL;DR): The AI managed to produce code after multiple back and forth that compiles and works on macOS. Somewhat remarkable given the many language changes Swift has seen over the years. As this was mostly systems related and required a bigger amount of human interaction with macOS (deactivating and removing a network extension), it was a 30/70 split between Claude code working and me verifying and prompting it.
Learnings:
- I was surprised that Claude managed to find the APIs required to remove a network extension from macOS. Those are not well documented and even hard for a human to Google. Also, generating code that works is somewhat of a feat itself. Being able to request moving some buttons or error texts from one section to another tab and back, without having to do it oneself, is very refreshing, and makes refactoring a bit less manual.
- The code quality resulting from the Claude code in this case was not good. It was functional and great for a PoC, but implementing it into production-ready code will take some time. It sometimes omitted existing delegates and reimplemented already existing things. (It also claimed to implement something that it didn’t build.)
Cost: ~$10
Time: ~30 min API time, ~90 minutes total time
What we’d do differently: Prompting Claude Code to write “production ready” code and providing it with enough context and memory to produce high quality code is yet to be researched and understood, in my opinion. Without prior knowledge and experience, being able to quickly add any feature or showcase an idea of how something might or should behave is amazing, getting it over the finish line is an entirely different challenge.
Task #3: Backend development in Go
Documented by: Cian, Staff Engineer at Coder. Works primarily on backend tasks, familiar with container orchestration tools, and author of this particular part of the codebase. Has used AI tools casually in the past.
Task: Modify a JSON REST API endpoint that invokes a predefined command to return a different part of the command output: https://github.com/coder/coder/pull/16866
Initial prompt: *"The agent/agentcontainers package defines a DockerCLILister. Part of the functionality of this struct is to enable listing running containers. There is a bug in the existing implementation where the returned port in the WorkspaceAgentListContainersResponse is incorrect. It should return the port of the container accessible from outside the container, but currently returns the port inside the container." *
Outcome (TL;DR): The agent partially solved the task, but required some re-prompting to arrive at this partial solution with explicit instructions to fetch data from NetworkSettings.Ports and to de-duplicate host ports mapped to the same container port (note: that there is some additional fine detail to this that wasn't originally apparent to me either).
It also needed to be explicitly prompted to add test coverage using the default settings – this could likely be improved by modifying the prompt settings – and had to be prompted to re-run all tests in the package after making changes.
I also noted a propensity of the agent to write overly complex code (see LHS of commit 077d5a) related to comparing expected versus actual test outputs.
Ultimately, I decided that there were insufficient underlying test data to ensure a correct solution, and elected to raise a separate PR with appropriate test fixtures. It is possible that this would have helped the agent in finding a correct solution.
Learnings:
- The agent did less well than expected; it seemed to just attempt the bare minimum of a solution without much attention to correctness.
- Tasks that require knowledge of external APIs and/or dependencies on external tooling seem to be difficult for this system to reason about. Providing example data up front and explicitly instructing the agent to address the provided test cases may help here, but also runs a risk of 'over-fitting' to the test data.
- The agent also seems to prefer smaller changes – for example, if changing a single line in a test case exposes a bug in the code being tested, it was observed to prefer changing the test code as opposed to the code under test. It had to be explicitly prompted not to modify the test code.
Cost: $2.65
Time: ~1h wall time (included initial setup), 7m 34s API time
What we’d do differently:
- Provide specific test cases and examples up-front, if possible – or instruct the agent how to generate valid test data.
- Instruct the agent to run package-level tests after changes in each package to establish a positive feedback loop.
- Enforce Test-Driven-Development (TDD) by providing specific instructions about which sets of files the agent may or may not modify. If there is a known bug, first instruct it to add a test case to reproduce the bug while disallowing edits of the code under test. Then instruct it to fix the code under test while disallowing edits of the test code.
- Remind the agent to perform a final refactoring step at the end (the oft-overlooked part of TDD!).
Conclusion
These experiments gave us a clearer sense of what AI agents can and can’t do today. Claude Code did well on small, well-defined tasks in familiar frameworks like React and Next.js. It struggled when deeper reasoning, system complexity, or poorly documented APIs were involved. What made this possible was running AI in isolated Coder workspaces, giving it a safe place to experiment without putting real code at risk. And while AI still needs human supervision, some of us are already using it daily to get through parts of our work faster.
We also came away with a better sense of costs, which is critical if you’re thinking about using AI at scale. Knowing what kinds of tasks AI can handle, how much it costs, and how to control that work safely is where the value starts to show. AI isn’t magic, but when paired with the right environment, it can be useful and is improving quickly. If you’re exploring this space too, we’d love to hear what you’re learning.
Next, we’re planning a follow-up where we explore how far Claude Code can go with better prompts, more context, and internal tooling to make our codebase more AI-friendly. Stay tuned for part two.
Subscribe to our Newsletter
Want to stay up to date on all things Coder? Subscribe to our monthly newsletter and be the first to know when we release new things!