For many of us in software of a certain age, making changes to and releasing software has traditionally been a perilous activity. Customers had to be hand-held, release sequences carefully orchestrated, and contingencies planned in case of failure. And so, the best releases were done by the best release engineers (remember that title?)–typically release engineers who were most averse to risk. Thankfully, we now know there’s a better way.
Software today is now ideally deployed into cloud-based infrastructures, with little or no customer involvement, with no risk of downtime, and–most critically–completely automated. The risk of breakage is minimized as the duration of Agile iterations dwindles to zero, ensuring that when breakages occur, they can be mitigated in minutes. What was once a massively delicate and stress-filled activity, can be automated into oblivion and reduced to an afterthought. This is the realm of continuous deployment and delivery and it is outstanding.
As teams I’ve worked with over the years have taught me, there are some key considerations in ensuring that continuous delivery is achievable, while maintaining everyone’s sanity.
Careful Code Review
Changes made to the code have always needed peer review. But in an environment where, once a changed has the approval of peers and quality assurance, it is immediately shipped off to the customer, the importance of thorough code review increases dramatically. This is not to say that perfection is the goal (an all too common aim of naive, less experienced programmers), but rather that every change is evaluated for it’s quality and thoroughness relative to it’s potential for negative impact on the application.
Release in the Middle of the Day
If you are waiting until off-hours to release your code, it’s probably a sign that your deployment process is fragile and not correctly mitigating the risk of downtime. Fear that you will break the application at peak usage breeds an overall fear of any change, much less a complex and innovative one. This, of course, has exceptions and is subject to the idiosyncrasies of your industry, product, and customer needs. But as a general rule, the best time to release software is when everyone is ready to monitor and turn around any fixes.
Of course, there are ways to avoid the impact of breakages during peak usage altogether, such as rolling updates, Green/Blue deployments, canary releases, and even very new approaches like Houston.
This is not an uncontroversial subject so flexibility on execution here is necessary. Go with what the team is comfortable with, but also make decisions that build confidence in automated releases at any time of the day.
How Fast Can You Fix Your Mistakes
When continuously releasing software to production, there is an increased need to be alerted, automatically, when an issue has occurred. The human rigor typically applied to big-bang releases–such as manual canary testing, full regression test, and so on–may not be possible given the frequency of releases. As such, when a breakage does occur, it’s critical to have a systematic approach for detecting and redeploying changes (or rolling back, as last resort).
A common model for managing security incidents applies to continuous delivery, as well. This graphic illustrates a few key KPIs that should be kept in mind.
While this graphic is intended to describe responding to security incidents, I like to imagine the same metrics apply when the incident in question is a planned release of your software that goes wrong. By managing the time to detect a problem independently from the time to recovery, a team ensures the most effective reaction times in the case of a failed release.
Stop Penalizing Failure
It is not a matter of if, but when, a software release will go wrong. How the team reacts to this failure is a good indication of whether or not they are ready for continuous deployment. In my professional past, I have been very negative and punitive in my feedback when projects are not successfully deployed. Healthier cultures and mentorship have shown me that a team that can calmly and quickly correct problems on their own is much more desirable.
“If you want something new, you have to stop doing something old”
― Peter F. Drucker
Infrastructure as Code == 🔥
When your entire application can be reasoned about, there is a sense that the risk of any given code change can be more easily reasoned about as well. For instance, if the configuration of an application clearly denotes that a production Web server cluster is set to deploy as a “rolling update” with a minimum of six instances, then the deployment team knows what to expect as the rollout proceeds.
Continuously integrating, testing, and releasing your application to customers is exhilarating and liberating, as long as you manage the risks of something going wrong. If you can identify and eliminate or minimize these risks, the benefits to your developers and ops people are massive.
So go forth and deploy things! Or, more accurately, build robots to deploy things.