It’s 7:00pm on a Friday evening and it’s release night, a night 6 months in the making. You have been tasked with releasing version 3.0.0 of your company’s latest software suite. The development teams have been working tirelessly on essential new features that your top customers are clamoring for.
You are sitting with your laptop open in the main conference room, waiting to execute the release. The VP of Technology is sitting to your left, along with the Head of Development, the Senior Developer and VP of Business Relations.
You can feel the tension in the air. The business team is expecting end users to resume reporting once the release is over in a few hours. The stakes are extremely high.
This new version was tested thoroughly in QA, but the infrastructure isn’t exactly the same. The database in QA has some test data, but is not a complete replica of production. QA is running a newer version of Ubuntu.
You ssh into the production servers to pull the latest code from master and restart the services. The development team is notorious for logging into production and manually making changes, so you hope there are no merge conflicts. Next, you follow the deployment guide you were given and manually create some tables for the application to function properly. But first you must update to a new version of a package to enable specific features. A majority of these changes can only really be tested in production.
You take a sip of your coffee, finish up your pizza, and then perform your changes and hope that everything works as expected. The only way to tell that it worked is to release. Rolling back would be extremely tedious and difficult, and would require an all night hack-a-thon.
Does this situation seem too familiar to you? Was your confidence level for your most recent release high or non-existent? What if a feature doesn’t work? Or worst of all, what happens when production is down with no immediate resolution? What will you tell your end-users in the morning who do not receive their reports? What if your largest customer decides that based on the issues with your last release, enough is enough?
Have you ever been in this situation? How do you eliminate the anxiety around release, and move towards a more anxiety free release process?
Here are six tips to reduce unnecessary cortisol, and improve the efficiency and efficacy of your next release.
- The first problem to tackle is the lack of similarity between QA and Production environments. Infrastructure as code will allow you to define your infrastructure in code, and ensure identical environments. You will have confidence that your hundreds of hours of testing aren’t wasted due to infrastructure drift. Terraform (https://www.terraform.io/) is a very popular tool to ensure you have consistent infrastructure across environments. Using the HashiCorp Configuration Language (HCL), you define the components needed to run a single application or your entire datacenter. You can use modules to organize related parts of your configuration into distinct logical components, which can be custom written in-house or imported from the Terraform Registry (https://registry.terraform.io/). Take a look at this excellent blog post on how to write more Informed Terraform code. (https://foghornconsult.wpengine.com/2020/07/20/informed-terraform/)
- The second major problem to eliminate is manual database changes. You should not be logging into a production database from your laptop (most certainly not over the public internet!) to create, read, update, or destroy database changes. There are way better ways to do this! There are a multitude of solutions to tackle this problem, but the recommended way is to use automation tools such as Flyway (https://flywaydb.org/). Flyway is essentially version control for your SQL based database. With Flyway, any changes to the database are called migrations and are stored in specially named .sql files. Development teams can collaborate on a single set of database changes that are applied sequentially, and the whole process can be integrated into an existing CI/CD workflow. Here is an example to migrate an existing database into Flyway, create a new table, then insert a few rows of data. Getting started is as simple as installing the command line tool and creating a database configuration file. First, you generate a SQL script that includes the entire DDL (including indexes, triggers, procedures…) of your production database and save it as ‘sql/V1__baseline_migration.sql’. Next, you run flyway against your new production database to prepare it for migration then apply your “existing” database changes to the “new” database. To create additional tables and insert data, you simply create a new files ‘sql/V2__Create_tables.sql’ and ‘sql/V3__Add_data.sql’ that contain the SQL create and insert statements. Finally, you run ‘flyway migrate’ and you will see that each migration step was applied sequentially, and your database is brought to the state of production, and the additional V2 and V3 steps were applied. Any additional changes would be stored in sequentially numbered files ‘sql/V4_Additional_data.sql’.
- Thirdly, you should not be manually pulling code from version control into a production environment (or any environment) and manually updating services and packages. A recommended way to do this is to create an immutable image from source code, with all packages and updates already completed. Packer is a very popular and robust tool to utilize, if you are deploying the whole OS with the artifact on it. The overall goal is to deploy a versioned image into production that is already validated and thoroughly tested.
- Another suggestion is to use blue/green deployment or canary deployments. Blue/Green deployment is the concept of running two identical production environments in parallel. The blue environment is your current production environment. The green environment contains your next release. The idea is to release a new version of your code to the green environment, then migrate traffic to this new environment. This would be prohibitively expensive with traditional static infrastructure, but combining cloud infrastructure with automated provisioning via Terraform, cost is no longer a blocker. One way to achieve this is to have a public-facing load balancer that is configured using internal DNS to route traffic to either the blue or green environment. To perform the release of your new environment, you simply configure the load balancer to route traffic from the blue environment to the green environment. Once all traffic is migrated to the green environment, you can decommission the blue environment, and the green becomes the blue. For your next release, you repeat the process by deploying the new version of your application to the green environment. You can set a short TTL time on your DNS servers to ensure this cutover happens in a matter of minutes. If you detect an issue with your newly deployed green environment, you can quickly restore to your previous blue environment.
- Canary deployments allow you to release your software to a small subset of your user base, perform testing, then gradually roll it out to the remaining set of users. This method allows you to have an early warning system (hence the name canary, as in canary in a coalmine), in the event that you release a change that results in a major outage. This way only a small subset of users experiences the outage, as opposed to your entire user base.
- The final suggestion can be summarized by the concept known as RERO or release early, release often. This is a software development methodology that stresses the importance of early and frequent releases. Rather than waiting several months to perform a large release with many major changes, your team would release smaller incremental changes at a more frequent pace. This allows you to validate smaller changes quickly because there are less changes to validate. This ultimately leads to higher quality software, as you are able to gather feedback from users and testers quickly and implement changes at a more accelerated pace.
Implementing all of these changes will not completely eliminate anxiety around deployments, but it could drastically reduce it. The onus to deliver shouldn’t fall on one team members shoulders, each member of the team should be making a meaningful contribution and shoulder some responsibility towards the release. The Release Engineer should be simply pressing a button. When everyone is working from the same playbook employee morale will rise dramatically. Your agile organization will be able to release quicker, more confidently, and most importantly, with minimal or no customer impact.