WebRTC (Web Real-Time Communication) enables delivery of audio and video conferencing applications using native web technologies, but a lesser-known feature is that it can tunnel arbitrary data. This is the story of how and why we migrated our networking from a traditional reverse proxy architecture over to WebRTC, and what we learned in the process.
Coder orchestrates development environments on your existing Kubernetes infrastructure, whether self-hosted or in the cloud. As with many cloud-native applications, Coder relied on a reverse proxy, which we called the “envproxy.” The envproxy routed traffic from outside the cluster into running workspaces. Coder is made up of a control plane—the service we call “coderd” which provides a dashboard for users to start and stop their workspaces—and a data plane⸺the envproxy and workspace.
In this context, the data plane refers to components that are in the critical path of the developer’s workflow; any failure would cause a disruption to a user’s development process. The proxy architecture provided a central point to enforce access control and perform audit logging, but it had consequences that led to a poor user experience:
Many of our engineers use SSH forwarding to tunnel PostgreSQL traffic or pass through an SSH agent socket from their local workstations to their Coder workspace. While this approach generally works well, the envproxy introduces a potential failure mode and inhibits upgrades; all in-flight connections must complete to gracefully shut down. A peer-to-peer architecture would resolve these challenges, while also providing a secure-by-default installation and reducing effort for system operators.
WireGuard solves the same fundamental challenge that we face with developer workspaces: securely providing end-to-end connectivity, across untrusted and unreliable networks, supporting a mesh operation mode and network roaming. Unfortunately, browsers do not include native support for WireGuard, and implementing it would be non-trivial, since browsers do not allow transmission of arbitrary UDP or even TCP traffic.
Our friends at Discord use the WebRTC protocol to stream media in real time, and we realized that Coder has similar requirements in terms of latency, security, browser and device support, and compatibility with diverse networking configurations. If we could tunnel arbitrary protocols over WebRTC, then we can leverage the existing ecosystem. Since browsers provide built-in WebRTC APIs, we would be able to modify our open source code-server project to use it as an underlying transport, providing an even faster editing experience. Better still, we can provide end-to-end encryption and minimize latency on local networks, even behind gateways using Network Address Translation (NAT), through WebRTC-compatible technologies including STUN, TURN relays, and DTLS.
With our new networking model, Coder uses WebRTC to broker connections between the user and services running inside their workspace. In order to maximize compatibility and provide a clean migration path, our initial implementation uses the open source Pion TURN relay server. The relay provides a rendezvous point accessible to both the user and the workspace they are trying to access, and which may be located either inside or outside of the Kubernetes cluster, as long as it is able to receive inbound connections from both:
An agent running inside the workspace establishes an outbound connection to the relay service, and the user connects to the same relay with a token to authorize the connection to the endpoint. Both the user and the workspace must be able to connect to the relay, but it is not a requirement for the workspace container to connect directly to the user’s workstation or for the user’s workstation to connect directly to the workspace container. This approach means that:
While we believe that our new approach to networking will yield significant improvements to the experience of installing and operating a Coder deployment, we’re just getting started. In future releases, we want to explore:
As a fallback when direct connectivity is not possible due to network conditions, traffic will flow through the TURN relay as depicted on the left. In future releases, users will be able to connect directly to their workspaces, relying on the STUN service to broker the connection and traverse NAT gateways. As a result, the relay will not be part of the data plane and interruptions to the STUN/TURN relay process will not affect connectivity to the workspace.
We’ll be talking about our migration to WebRTC on this week’s edition of Coffee and Coder—join us on Twitch with any questions you have for us and to learn more.