Mar 11 202411 min read

Coder's Well-Architected Framework

Eric Paulsen

Introduction

This framework provides design principles and architectural best practices for building and operating a highly reliable, efficient and scalable Coder deployment. Built upon the six pillars of the AWS Well-Architected, Coder’s Well-Architected is intended for prospects and customers looking to provide a sustainable cloud development environment solution to their internal developer groups. This document is written for Coder’s operators, whose mission is to support their end-users, who may be developers, data scientists, or business analysts.

Terminology

In the Coder Well-Architected Framework, we use these terms:

A CDE (Cloud Development Environment) is an on-demand, ephemeral development environment that is hosted in the public cloud or on-premises. The developer connects to the CDE to access their development environment, either via an in-browser IDE, or through a desktop IDE.
An IDE (Integrated Development Environment) is a text editor developers use to write their code. The Coder workspace exposes the IDE in-browser. Alternatively, a desktop IDE can be used to connect to the Coder workspace.
A KPI (Key Performance Indicator) is a quantifiable metric used to measure the success of an organization. In this case, it is the success of your engineering organization, from a developer productivity and experience perspective.
A workspace is the development environment Coder users work inside of. They are provisioned using Terraform, inside of the customer infrastructure. Each workspace will have one associated user.
Templates are a Coder construct consisting of Terraform files, run by the Coder provisioner to create workspaces on behalf of users. Templates may be one or multiple files that represent the underlying infrastructure that hosts the workspace. The Coder server is the control plane of the Coder application. It proxies workspace connects, runs workspace provisioning jobs, makes database calls, and serves the web interface.
A provisioner is a Terraform runner in the Coder context. It is responsible for start/stop/delete operations on workspaces. One or more can be run at a time, and can be run separately for network isolation purposes. Templates can be tagged to be run and built by specific provisioners.
A workspace proxy is a server responsible for proxying browser and CLI-based connections to Coder workspace IDEs and terminals. Proxies can be run across regions to serve globally distributed users.

Personas

In a Coder deployment, there are two personas:

The end-user is someone who uses Coder to create and access development environments. They could be developers, analysts, or data scientists. The end-user consumes a license seat and has login access to Coder. They are typically a member of a user group that works on a project or specific set of projects.
The operator is responsible for running and maintaining the Coder deployment, either at an application and/or infrastructure level. Operators may be part of a platform engineering or developer experience group. Depending upon the size of your organization, there may be 1-4 operators that provide support to end-users and implement new versions, features of Coder. Operators may consume a license seat, and typically have elevated privileges in Coder, such as User/Template Admin, and/or Owner roles.

Reference Architecture

Single-region Coder Enterprise deployment

Open full-size in a new window

Multi-region Coder Enterprise deployment

Open full-size in a new window

(1.) Operational Excellence

a. Understand your developer community - When evaluating a CDE, it is important to understand the needs of your developer community, both from a tooling and experience perspective. Ask yourself the following questions:

How do developers set up their computing environment when starting on a new project?
How do developers authenticate to other services (version control, package repositories)?
What IDEs are prevalent within the company?
What languages, tools, and dependencies do our developers use?
Can we containerize the development environments?

b. Start small - When building templates, start with the simplest possible workspace, before layering on additional infrastructure and use-cases. As with any application, it's best to build the smallest working unit of the code before adding complexity.

c. Consider workspaces to be ephemeral - CDEs drive value because the environments they create are highly consistent and reproducible. If your environment no longer works, delete and recreate it. You no longer have to limit yourself to a single machine for your development environment. Create multiple workspaces, each unique to a specific project. Keep in mind that workspace disks can also be persisted.

d. Perform changes as code - Operate Coder as you would any highly available application, and use automated deployment methods (such as CI/CD, infrastructure-as-code tooling) to iterate on both the Coder deployment and your workspace templates.

e. Gain insights by implementing observability - Establish KPIs tied to your business outcomes, such as daily active users, or workspace utilization, to inform decision making and improve the return on your investment.

(2.) Security

a. Evaluate your organization’s security needs - Pre-configure development tools to point to internal package repositories, and consider blocking access to public repositories at the network layer.

b. Protect data in-transit and at rest - Encrypt network traffic end-to-end using TLS certificates, and encrypt the PostgreSQL database to protect sensitive data such as user access tokens. Workspace disks can also be encrypted if necessary.

c. Implement role-based access control on templates - Templates enable your developers to create infrastructure in a self-service manner. Use role-based access control to limit which users and groups of users have access to specific templates. For example, you may need to segment certain business units–such as full-time employees and third-party contractors–from one another, to prevent unapproved access to infrastructure and code.

d. Isolate provisioners to specific clouds/clusters - Run a provisioner deployment inside each cloud or cluster you plan to use for workspaces. Utilize the native service identity for that cloud or cluster to authorize provisioning. Enforce the smallest set of privileges that allow successful provisioning within each cloud or cluster. This will reduce the blast radius of a compromised provisioner.

e. Enforce workspace updates - Templates enable you to update your fleet of developers with one push, whether that update consists of a patched vulnerability, or a tooling change. Put policies in place to update workspaces immediately upon creating a new template version, using Coder’s template update policy feature.

f. Collect Coder audit logs and transmit them to your SIEM or SOC - Audit logs allow you to trace key events like users accessing Coder’s dashboard or workspace in the event of a compromised account or workspace. They are invaluable in incident response.

(3.) Reliability

a. Run Coder in high availability mode - Run multiple replicas of the Coder server, so in the event of one going down, the other replicas can continue serving dashboard traffic and workspace connections. This also allows upgrades to proceed without taking the Coder service offline.

b. Leverage Prometheus metrics to identify problems - Monitor the health of your Coder deployment and workspaces by querying and analyzing the platform’s Prometheus metrics. These metrics provide time-series data on endpoint failures and resource utilization that will inform you of possible weaknesses in your network or infrastructure.

c. Implement a failover strategy - Think about the SLA you have in place for other developer tools, such as version control, and determine how much downtime is acceptable. Take into account potential infrastructure failures, and replicate your Coder server and database across regions and zones.

d. Take periodic database snapshots - Take snapshots of the Coder database on a cadence, and before every upgrade. This ensures that you can restore Coder to a known, stable state in case any issues arise. Make sure to store your templates in version control, so that if the database is lost, your templates can be pushed back into Coder seamlessly.

e. Run workspaces on scalable compute - Developer workspaces can demand large amounts of resources, especially when building or compiling code. In addition, there can be times where large numbers of workspaces are being requested. Leverage auto-scaling compute to meet these resource demands efficiently.

f. Run the Coder server on static compute - Ensure the Coder server is run on static compute to prevent end-users from being impacted. Alternatively, if Coder is run on dynamic compute and a scale down event occurs, dashboard and/or workspace connections may be dropped.

g. Enable multiple STUN servers - Coder’s networking uses STUN servers to learn about network address translation (NAT) between users and their workspaces to allow direct, encrypted tunnels. These direct tunnels generally have more stable latency and are resilient to Coder server or workspace proxy instances going down (either planned or unplanned). If users connect to workspaces over a corporate network or VPN, deploy two or more STUN servers inside the corporate network and configure Coder to use them.

h. Allow UDP traffic between users and workspaces - Coder’s direct, encrypted tunnels run over UDP on ephemeral ports. If firewalls block UDP then direct connections will not be possible and tunnels use Coder server and/or workspace proxies as relays. Such relays provide slower connections and consume server-side resources.

(4.) Performance efficiency

a. Keep workspaces close to users - To minimize connection latency (and improve the developer experience), deploy workspaces in the regions close to where the end-users are located.

b. Run workspace proxies near your users - The Coder server can proxy workspace connections, but not all of your end-users will be closely located to the region where it is run. Take advantage of workspace proxies to localize the connection proxying to both developer and workspace.

c. Run workspaces near your data and services - Where possible, deploy workspaces near your code repositories and data sources, to take advantage of lower latencies when cloning project repositories and downloading large data sets. Coder enables you to deploy workspaces in both your on-premise and public cloud infrastructure.

d. Run your database in the same region as the Coder server - Optimize database latency by keeping the Coder server close to the database, ideally in the same region. This will ensure a snappy user experience when navigating the dashboard, and interacting with workspaces, templates. For the best experience, keep latency under 10ms or less between Coder server and database.

(5.) Cost Optimization

a. Automate workspace shutoff - Prevent workspaces from running in perpetuity and running up your cloud bill. Enable auto stop to ensure workspaces are shut down after a specified period of inactivity. Enforce nightly/weekly shutdowns to avoid continuous use of outdated workspaces.

b. Share compute across workspaces - When running Kubernetes-based workspaces, use the resource requests/limits to bin pack multiple workspaces on a single node to share compute and optimize cost. Software development workloads tend to have short bursts of resource utilization during compile and test, with longer periods of lower utilization while writing code. As such, the impact of sharing resources is negligible from a developer experience perspective.

c. Allocate resource quotas - Set quota allowances for your developer groups to control how many workspaces of a given template each user can create. Determine how much a particular resource costs, and define such costs on the template to count against the allocated quota.

d. Tag and label workspaces - Label and tag your workspace resources in the templates, so they can be filtered for costs in your cloud console. For example, you may want to label workspaces by business unit or use-case for chargeback purposes. This will give you greater insight into the efficiency of your workspaces, and you can leverage cloud-native billing solutions that are currently available on the market.

Conclusion

By leveraging Coder’s Well-Architected framework, enterprises will be able to provide their developer community with a highly secure and reliable cloud development environment platform. In addition, putting these best practices into practice will help you achieve your business objectives for improved developer productivity and information security.

Related Content: