If you've ever heard an engineer lamenting (or sometimes arrogantly proclaiming) "well, it works on my machine" then you have been witness to configuration drift. Something changed (besides the codebase) — either on the developer's machine or on the server where the code was just deployed. It could be that the developer was using a mismatched version of a library, or a recent security change on the server could be the culprit. Whatever changed it was just enough to break the new code.
Hopefully, the problem was discovered in testing and not on production. Hopefully, the problem will be easy to spot. Debugging could take hours, maybe even days, of tedious work. Meanwhile, leadership grows anxious about yet another delay and the reports showing their team's productivity KPIs.
Configuration drift is any unplanned variance between a system's expected or recorded state and its actual state.
In some cases, drift may have no impact on the daily performance of a system. In other cases, it can cause instability or even total service outages. While the latter scenarios are the ones that DevOps teams usually dread, as I'll show later, drift that goes unnoticed can sometimes wreak the most havoc on an organization.
As software production and delivery systems grow more complex, the opportunities for drift increase. Code moves from a developer's workstation to a shared dev environment, then to test and QA environments, and finally to staging and production environments. The further along in the pipeline that drift occurs the greater the potential impact. Typically only staging and production are intended to be exact replicas of each other, but even differences between the version of a package version installed on a developer's laptop and the version installed on a test server can cause delays as problems are debugged. With many companies now deploying new code multiple times per day, the pressure is intense.
A 2019 report by DevOps Research and Assessment (DORA) and Google Cloud revealed how often deployments fail, i.e. "resulted in degraded service (e.g., lead to service impairment or service outage) and subsequently require remediation (e.g., require a hotfix, rollback, fix forward, patch)." Some companies reported that up to 60% of their deployments failed.
Configuration drift is often the root cause of deployment failures.
Each step of the CI/CD pipeline presents multiple opportunities for drift to occur. There are, however, a few common culprits.
I once worked in the data warehouse for a national grocery store. With data entering the warehouse from dozens of different source systems, it was not uncommon for a structural change in the data from an upstream system to cause problems in the downstream reporting. The changes were often intentional, but the upstream teams had failed to communicate with their downstream partners about the impending change and the downstream systems broke as a result.
Takeaway: Even carefully planned changes are drift in highly complex integrated systems if all impacted parties are not informed or included in the planning.
Hotfixes are code changes made to address an immediate problem that cannot wait for the next planned release. To understand how they can result in configuration drift, let's return to the data warehouse scenario where the upstream changes are now causing problems downstream. VPs are hitting refresh on their Tableau reports and growing impatient. Emails and chat messages are flying.
In such situations, warehouse data engineers would often make hotfixes on the production server — modifying the DDL or reporting queries in real-time to make the reports behave as expected. After solving the problem, the weary engineers sometimes failed to document and make the same fixes to the other environments in the warehouse's CI/CD pipeline. Thus, sending the production server into a state of drift. Often, this drift would be discovered only after pushing a subsequent code update from staging to production, overwriting the hotfix and reintroducing the original problem.
Takeaway: Hotfixes are often performed under intense pressure, and by its very nature, the work falls outside of standard procedures, creating a perfect storm for configuration drift.
Critical security updates are close cousins with hotfixes. Both are often performed at breakneck speeds without adherence to normal procedures. The primary difference is that hotfixes are applied to fix an immediate problem, while critical package updates are applied in hopes of avoiding a future incident.
Takeaway: It bears repeating that work performed under intense pressure that doesn't follow standard procedures is a breeding ground for configuration drift.
Modern CI/CD pipelines typically include built-in checks for errors and tests. The move to the cloud and infrastructure as code paradigms allow us to replicate systems and scale out existing systems with ease. Infrastructure as code inherently provides a record of current and previous configuration, eliminating the problem of failing to document planned changes. As we will see, configuration management tools can spot and sometimes stop configuration drift before it becomes a problem.
This is the vision.
However, in most cases, reality differs from the ideal. Many organizations have a hybrid architecture consisting of mutable and automatable cloud resources and legacy systems that are immutable and continue to function thanks only the esoteric knowledge of a select few sysadmin wizards whose work cannot be automated because it has never been documented in the first place. Or the applications are running on antiquated hardware and can't be ported to the cloud.
Takeaway: You can't automate away configuration drift, but without automation you'll have a lot more of it.
Many changes are meant to be temporary. A sysadmin grants a developer elevated permissions to troubleshoot a problem. Someone installs a new package on a test server to try out its new functionality. Bucket permissions are changed in order to share a file quickly with a teammate. When these changes are not reverted to their original state drift occurs.
Takeaway: Not all change is drift, but once it is forgotten it may become drift.
The most commonly listed consequences of configuration drift are lost productivity and downtime as engineers troubleshoot code and environments, trying to identify the cause of the unexpected behavior.
Unplanned downtime is undeniably costly, though how costly is a matter of debate. A commonly cited Gartner report estimates that on average each minute of unexpected downtime costs a company $5600. Published in 2014, this report is now dated and has spawned a small industry of posts offering updated and refined calculations.
More recently, though, Gartner senior analyst David Gregory suggested that trying to pin a dollar figure on downtime misses the broader picture.
"I&O [Infrastructure and operations] leaders often try to leverage information from various sources about how much an outage will cost per hour. Instead, focus on impacts to stakeholders of the business — the end users or the individual business operations leaders — and how those outcomes are directly impacted by the loss of IT services."
The same argument applies to understanding the costs of configuration drift. We need to move beyond thinking that the primary impact is limited to the DevOps team and think more about the implications for a broad range of stakeholders.
In a very detailed (and to be applauded) incident report posted to their blog, Twilio identified the root cause of the intrusion to be a change made to the bucket configuration while troubleshooting an earlier problem. After applying the hotfix, the engineer working the problem failed to roll the bucket's configuration back to its original secure settings, and configuration drift ensued. Interestingly, the misconfiguration remained undetected for over four years before the breach occurred.
While Twilio identified the root cause of the breach and remediated it fairly quickly with limited impact, other companies haven't been as fortunate. Hackers applying the same techniques used in the Twilio incident also breached British Airways systems and stole the personal information and credit card numbers of almost a half-million customers.
As a result of their lapse, the company was fined £183.39 million ($229.2 million) under GDPR regulations, the largest fine ever accessed at the time. Stock prices of British Airways' parent company tumbled in the weeks and months following the disclosure of the breach. Currently, British Airways is facing the largest class-action lawsuit in British history filed by impacted customers, with a potential payoff estimated to be in the billions.
Clearly, the consequences of configuration drift go far beyond frustrated engineers, delayed deployments, and unplanned downtime. The security threats and potential damage to both companies and their customers are immense and not to be taken lightly.
Configuration drift is, unfortunately, probably inevitable given the current CI/CD pipeline. With the push for continuous delivery, hotfixes and critical package updates will continue to be deployed inconsistently. And then there is always just good old human error.
Eliminating configuration drift is a fool's errand.
Instead, you should approach drift through the paradigms used to manage risks: reduce the probability that the assumed risks actually materialize and improve the company's ability to manage or contain the risk events should they occur. To manage configuration drift focus on a three-part strategy that attempts to:
While it may be impossible to prevent configuration drift, you can certainly reduce its frequency.
As outlined above in the causes of configuration drift, human error is often the true culprit, whether by failing to follow established change management procedures or by failing to communicate and plan effectively. Of course, no one can follow rules that do not exist. If your organization does not have them, then clearly defined change management policies and procedures must be your first priority.
The next step is to automate as much as possible. The DORA and Google Cloud report cited earlier found that the organizations with the lowest rates of failed deployments also had the highest rates of automation. A single CloudFormation script can configure and launch dozens or hundreds of servers on AWS all configured exactly the same and that can, therefore, be counted on to run your code exactly the same way in production.
Organizations should also look to automate the beginning of the pipeline — the creation and management of developers' workspaces. Traditionally, developers have configured their own workstations with the tools and libraries necessary for a project. As a result, these machines are particularly prone to drift and misconfiguration. Changes to staging and production environments are inconsistently communicated to the developers at the opposite end of the pipeline.
Developer workspaces need to match the configuration of the production environments where their code will eventually reside, otherwise, problems will eventually arise. With dozens or hundreds of developers working on a particular project, the traditional method of having each developer be responsible for configuring their own workspace is simply not sustainable.
A better solution is to automate the creation of developer workspaces from pre-configured images that match the configuration of the higher environments and contain all the tools and libraries necessary for the project. This approach brings the infrastructure-as-code paradigm to the beginning of the pipeline.
Probably the most common method of detecting when configuration drift has occurred, unfortunately, is breaking something. Clearly, not ideal.
A proactive, systematic approach to detecting configuration drift requires two things: documentation of a system's intended state and regular auditing of its current state.
Configuration management tools and services such as Chef, Puppet, Ansible, AWS Config, and Google's Cloud Asset Inventory automate the process. These tools continuously monitor your resources, record any configuration changes, and send alerts as needed.
Organizations should also regularly conduct manual audits as well, though. Configuration management tools can only detect changes in systems that they have been configured to monitor. In today's complex multi-cloud and hybrid environments knowing the totality of your assets can be a challenge in itself.
Once detected drift must be corrected. In the case of the Twilio breach, they only had to change the permissions on the bucket and replace the corrupted file with a backup from before the breach. Detailed logs made it possible to identify when the breach occurred and automated backups ensured they had a clean version of the file to restore.
Deployments can be automated and configured to roll back changes at the first signs of failure, reducing or even eliminating downtime, though it may take some human detective work to figure out why the problem occurred.
Many configuration management tools and services tools take a proactive approach to remediating drift. After detecting a change, they can automatically restore the system to the last documented intended state. Be careful, though, before enabling such automation — if the drift resulted from an undocumented hotfix or critical package update you could be restoring the system to an unstable or insecure state.
Configuration drift is the root of many failures in CI/CD pipelines. Far from just an annoyance for developers, drift can have a major impact on services, customers, and organizations as a whole and should be treated as a significant source of risk.
As with any other risk, drift cannot be eliminated entirely. Instead, it must be managed to reduce its probability and contain its impact when it does occur.
Automation is one key component to reducing the occurrence of drift as well as detecting and correcting it.
At Coder, we help your organization manage the risk of configuration drift by moving developer workspaces to the cloud and automating their creation and orchestration. Our product lets you define developer workspaces as code and spin up new environments quickly and securely. Developers can access the workspaces remotely through a browser. No more need to spend hours or days configuring their workstation or VDI. Developers can literally create new workspaces in less than five minutes and start writing code almost immediately.
Coder encourages the practice of rolling out a new image with new dependencies instead of depending on an engineering manager telling everyone to "make sure to upgrade to Python 3.5.3." When the production environment changes, all that is required is an update to the workspace Docker image. These changes can then be pushed out to all workspaces created from that image — configuration drift averted for the entire team throughout the entire pipeline.
To learn more about how Coder can help your organization reduce the risk of costly and dangerous configuration drift, schedule a demo today.