4 Ways to Apply Updates in the Cloud
If you've been in IT long enough, you can think of a time when a patch was applied on Friday, only to bring down a production system on Monday. That's why stakeholders tend to raise eyebrows when a developer tells them a software update is needed. In most cases, they believe that it's better to maintain current software versions instead of risking a production disaster. Unfortunately this if-it-ain't-broke-don't-fix-it mentality can lead to serious software deprecations, and result in steep technical debt.
Luckily, this precarious situation can be mitigated by using cloud technology, specifically a strategy called continuous integration (CI). Throughout this post, we will outline four different methods to applying software patches and updates that are guaranteed to be as seamless as possible.
Method 1: Production vs QA vs Development
Development to QA to production is the most common update method. P-Q-D is a method of applying software to three versions of the same app. For example, let's say you were developing a new banking app and needed to update the front-end Angular framework.
A software developer would apply the update into the development environment. Additionally, he would perform smoke-tests and write automated tests to catch the bugs. Then, barring no issues, the developer would promote the code to QA.
The QA environment must mirror the production environment as closely as possible. Oftentimes companies have an additional environment called the staging environment. The similarity between the two environments is important: because if the patch fails in one environment, it will fail in the all-important production environment. With that said, the developer promotes the update to QA and the quality-assurance experts get to work.
The QA tester runs through a gamut of tests and realizes that it is breaking JQuery code in the front end. For example, maybe the autopay functionality stopped working on the banking app on Internet Explorer. In other words, if this bug had made it into a production environment, the customer wouldn’t be able to pay their bills because of us. Not good!
Because the quality-assurance specialist caught this, she will be able to kick it back down to the developer for additional rework, thus saving everyone a big headache.
Also, this P-Q-D CI methodology is platform agnostic: meaning it can be applied to a cloud or non-cloud infrastructure. In fact, each of the three major cloud provider has their own version listed below:
Google Cloud—Cloud Build
In addition to proprietary pipelines mentioned above, a popular CI framework called Jenkins will integrate with any of the major cloud players. An entire article could be written on this method of patch application, however let's jump into a different, and possibly safer method: rolling updates.
Method 2: Rolling Updates
A rolling update is when servers are updated incrementally instead of all at once. This is particularly useful when the business requires zero downtime on their application. One such time this would be useful is if a critical production bug was discovered during peak business hours. Loss of uptime equals loss of revenue after all.
For example, say you are the proud owner of a clothing website. This website receives around 10,000 users an hour and it’s load balanced between five different servers. Load balancing in this case means traffic is routed equally between the five servers to reduce latency and load times. Everything is going great until someone notices a serious bug: the BUY button is grayed out. Somehow during the last deployment this bug slipped in unnoticed.
One solution would be to update all servers all at once, unfortunately, that would mean all servers would restart at the same time. That's 10,000 potential customers who are not able to access the site — and that just simply won't do. A better solution would be rolling updates. After the code is deployed to production, it only deploys to one server. Then the load balancer directs all the traffic that would go to that server to some other server. Once that server has restarted, traffic is redirected to it. This process is repeated until each and every server is updated. In this case, there is zero downtime, and the button has now been fixed.
One of the downsides to this method is that while the rolling update is occurring, half of the servers are fixed, the other half are not. Is it possible to reduce the downtime even more? Let's see if blue-green deployment can do the trick.
Method 3: Blue-Green Deployment
Blue-green deployments work by having two servers that are replicas of each other. One is the active production server and the other the stage server. The stage server is a replica in every way of the production server. Whenever an update needs to be made, it is done to the offline server. Once the update is complete, we simply switch the production server over to the stage server. Voilá! We just performed a software update with zero downtime.
While at a high-level, this is pretty simple to understand. However there are some nuances worth discussing, so let's delve into that. Specifically the three different types of blue-green deployments.
3 Different Types of Blue-Green Deployment
The key difference in blue-green methodology comes down to the amount of traffic you want going to the newly updated server. The three different methods are Canary, Linear, and All-At-Once.
What is a Canary Deployment?
A canary deployment is when traffic is shifted over to a new server in set increments. One reason to do this is if you are putting out a new feature and are unsure whether users will like it or not. Wouldn't it be nice if you could just test it on a few of them? That's where canary comes into play.
Canary deployment would say, "Just send over 5% of the traffic after the update is complete." Then, if you find that the bounce rate vastly increases, then maybe you will want to hold off on sending the rest of the traffic over. However if the bounce rate is normal or preferably less, then the next increment may be 10%, then 30%, and so on until all traffic is routed to the new server.
What is a Linear Deployment?
Linear deployment is similar to the canary deployment. Remember that the principal of each of these deployments is the same — they are all blue-green. Bearing that in mind, the only thing that can really change is how we would like to route traffic to the new production server. In a linear deployment, every increment is the exact same. So first you would route 10%, then another 10%, etc. So every 10 minutes, an extra 10% of the traffic would route to the new server. Therefore, the shift would occur in a predictable 50 minutes.
What is an All-At-Once Deployment?
An All-At-Once deployment is about as basic as it gets. Once the update has been applied to the stage server, the team will route traffic to it all at once. The biggest advantage of this is that everybody receives the update at the same time. Of course, like most things in life, its greatest asset can be its greatest weakness. If something went wrong, then everyone will see it.
Because All-At-Once deployments are so much more basic than the other two, they are generally far easier to set up. So for a new project, it may be a good idea to start as an All-At-Once deployment just to get going, and then switch to another one of the two later if need be.
Method 4: Failover Cluster
Failover clusters are a little different from the other three, but are important to consider when building your continuous integration architecture. A failover cluster consists of two more nodes that are always in communication with each other. One node cluster is the current production server, and the other one is on standby. Notice that this is similar to the blue-green deployment, however there are some differences.
For one, the two nodes are in constant communication with each other. It is a basic connection called a heartbeat. All the two nodes are doing is making sure that the other one is still online. Should this heartbeat fail to make a connection, the cluster that is on standby is officially activated.
Failover clusters are very common with Microsoft SQL Servers and other MS proprietary software. One of the advantages of updating failover clusters is if something goes wrong, it is very easy to "failover" to the passive server with little to no downtime. Another great advantage of failover clusters is that they reduce single points of failure.
In the grand scheme of things, the most common deployment strategies you will see are blue-green and deployment vs. QA vs. production. Blue-green deployments are perfect for cloud environments because the infrastructure can be described as code. That makes it very easy to create replicated instances of production. Additionally, rolling updates are very manageable in the cloud. This is due to event-driven programming paradigms such as Amazon Lambdas or Azure Functions.
Furthermore, these methodologies can be used in conjunction with each other. For example, a business could have both their blue-green-deployed servers on a failover cluster. Or their rolling updates could have a dev, QA, and production pipeline. Both of these scenarios are common. Whichever method your company chooses, it will work out fine as long as everything is configured properly.