Rearchitecting Coder’s networking with WebRTC
WebRTC (Web Real-Time Communication) enables delivery of audio and video conferencing applications using native web technologies, but a lesser-known feature is that it can tunnel arbitrary data. This is the story of how and why we migrated our networking from a traditional reverse proxy architecture over to WebRTC, and what we learned in the process.
Where we started
Coder orchestrates development environments on your existing Kubernetes infrastructure, whether self-hosted or in the cloud. As with many cloud-native applications, Coder relied on a reverse proxy, which we called the “envproxy.” The envproxy routed traffic from outside the cluster into running workspaces. Coder is made up of a control plane—the service we call “coderd” which provides a dashboard for users to start and stop their workspaces—and a data plane⸺the envproxy and workspace.
In this context, the data plane refers to components that are in the critical path of the developer’s workflow; any failure would cause a disruption to a user’s development process. The proxy architecture provided a central point to enforce access control and perform audit logging, but it had consequences that led to a poor user experience:
- If the envproxy process crashes or restarts due to an upgrade, clients experience a brief interruption and programs need to re-establish any in-flight connections.
- The proxy imposes additional round-trip latency, since it must read incoming stream data, decrypt it, and then re-encrypt it, so that we can forward it to the target workspace. As with audio or video calls on an unreliable connection, any visible latency when editing source code interactively results in similar frustration.
- The proxy operates at layer 7, constraining the protocols that developers can use to those that the proxy understands: for example, developers cannot access PostgreSQL databases in their workspace using its native binary protocol. The alternative is to use SSH port forwarding to tunnel these connections.
- The proxy is an additional service to maintain, and since it is shared with multiple users of the system, any issues could impact multiple users simultaneously.
- Services running inside the workspace containers need to listen for traffic from the proxy. While we use Network Policies to secure these, cluster administrators must ensure that the cluster networking enforces these policies. For example, Amazon’s Elastic Kubernetes Service (EKS) does not enforce these policies by default, and we want to take a defense-in-depth approach to enforce security boundaries.
- Multiple points of failure meant that things could fail without being clearly diagnosable, and sometimes the failures would not be apparent to users.
What we evaluated
Many of our engineers use SSH forwarding to tunnel PostgreSQL traffic or pass through an SSH agent socket from their local workstations to their Coder workspace. While this approach generally works well, the envproxy introduces a potential failure mode and inhibits upgrades; all in-flight connections must complete to gracefully shut down. A peer-to-peer architecture would resolve these challenges, while also providing a secure-by-default installation and reducing effort for system operators.
WireGuard solves the same fundamental challenge that we face with developer workspaces: securely providing end-to-end connectivity, across untrusted and unreliable networks, supporting a mesh operation mode and network roaming. Unfortunately, browsers do not include native support for WireGuard, and implementing it would be non-trivial, since browsers do not allow transmission of arbitrary UDP or even TCP traffic.
Our friends at Discord use the WebRTC protocol to stream media in real time, and we realized that Coder has similar requirements in terms of latency, security, browser and device support, and compatibility with diverse networking configurations. If we could tunnel arbitrary protocols over WebRTC, then we can leverage the existing ecosystem. Since browsers provide built-in WebRTC APIs, we would be able to modify our open source code-server project to use it as an underlying transport, providing an even faster editing experience. Better still, we can provide end-to-end encryption and minimize latency on local networks, even behind gateways using Network Address Translation (NAT), through WebRTC-compatible technologies including STUN, TURN relays, and DTLS.
Why and how we use WebRTC
With our new networking model, Coder uses WebRTC to broker connections between the user and services running inside their workspace. In order to maximize compatibility and provide a clean migration path, our initial implementation uses the open source Pion TURN relay server. The relay provides a rendezvous point accessible to both the user and the workspace they are trying to access, and which may be located either inside or outside of the Kubernetes cluster, as long as it is able to receive inbound connections from both:
An agent running inside the workspace establishes an outbound connection to the relay service, and the user connects to the same relay with a token to authorize the connection to the endpoint. Both the user and the workspace must be able to connect to the relay, but it is not a requirement for the workspace container to connect directly to the user’s workstation or for the user’s workstation to connect directly to the workspace container. This approach means that:
- The workspace agent process can establish an outbound connection, rather than creating a listening port to accept inbound connections.
- The proxy operates at the transport layer (the encrypted TCP stream between the workspace and relay service) rather than the application layer (SSH, HTTP, HTTPS), so it is protocol-independent and capable of tunnelling arbitrary protocols between the user and their workspace.
- Only the relay service and control plane need to be accessible by clients, so we minimize the points of entry that administrators need to secure.
While we believe that our new approach to networking will yield significant improvements to the experience of installing and operating a Coder deployment, we’re just getting started. In future releases, we want to explore:
- Now in Coder 1.22 (Aug 2021): Full peer-to-peer communication using STUN to reduce bandwidth demands on the TURN relay server and eliminate it as a single point of failure
- WebRTC support for code-server, so that connections go directly between the user’s browser and development workspace, using STUN and TURN for NAT traversal
- An in-browser WebRTC proxy using a web worker to transparently tunnel arbitrary HTTP/HTTPS network traffic
- Testing of high-availability configurations to ensure that the networking remains reliable despite a variety of adverse network conditions
- Using Twilio’s Global Network Traversal Service, which hosts a global network of STUN and TURN relays, instead of the built-in service
As a fallback when direct connectivity is not possible due to network conditions, traffic will flow through the TURN relay as depicted on the left. In future releases, users will be able to connect directly to their workspaces, relying on the STUN service to broker the connection and traverse NAT gateways. As a result, the relay will not be part of the data plane and interruptions to the STUN/TURN relay process will not affect connectivity to the workspace.