A blue-green deployment is a way to have incremental updates to your production stack without downtime and without any complexity for properly handling rolling updates (including the rollback functionality)
I don’t need to repeat this wonderful explanation or Martin Fowler’s original piece. But I’ll extend on them.
A blue-green deployment is one where there is an “active” and a “spare” set of servers. The active running the current version, and the spare being ready to run any newly deployed version. The “active” and “spare” is slightly different than “blue” and “green”, because one set is always “blue” and one is always “green”, while the “active” and “spare” labels change.
On AWS, for example, you can script the deployment by having two child stacks of your main stacks – active and spare (indicated by a stack label), each having one (or more) auto-scaling group for your application layer, and a script that does the following (applicable to non-AWS as well):
- push build to an accessible location (e.g. s3)
- set the spare auto-scaling group size to the desired value (the spare stays at 0 when not used)
- make it fetch the pushed build on startup
- wait for it to start
- run sanity tests
- switch DNS to point to an ELB in front of the spare ASG
- switch the labels to make the spare one active and vice versa
- set the previously active ASG size to 0
The application layer is stateless, so it’s easy to do hot-replaces like that.
But (as Fowler indicated) the database is the most tricky component. If you have 2 databases, where the spare one is a slave replica of the active one (and that changes every time you switch), the setup becomes more complicated. And you’ll still have to do schema changes. So using a single database, if possible, is the easier approach, regardless of whether you have a “regular” database or a schemaless one.
In fact, it boils down to having your application modify the database on startup, in a way that works with both versions. This includes schema changes – table (or the relevant term in the schemaless db) creation, field addition/removal and inserting new data (e.g. enumerations). And it can go wrong in many ways, depending on the data and datatypes. Some nulls, some datatype change that makes a few values unparseable, etc.
Of course, it’s harder to do it with a regular SQL database. As suggested in the post I linked earlier, you can use stored procedures (which I don’t like), or you can use a database migration tool. For a schemaless database you must do stuff manually, but but fewer actions are normally needed – you don’t have to alter tables or explicitly create new ones, as everything is handled automatically. And the most important thing is to not break the running version.
But how to make sure everything works?
- test on staging – preferably with a replica of the production database
- (automatically) run your behaviour/acceptance/sanity test suites against the not-yet-active new deployment before switching the DNS to point to it. Stop the process if they fail.
Only after these checks pass, switch the DNS and point your domain to the previously spare group, thus promoting it to “active”. Switching can be done manually, or automatically with the deployment script. The “switch” can be other than a DNS one (as you need a low TTL for that). It can be a load-balancer or a subnet configuration, for example – the best option depends on your setup. And while it is good to automate everything, having a few manual steps isn’t necessarily a bad thing.
Overall, I’d recommend the blue-green deployment approach in order to achieve zero downtime upgrades. But always make sure your database is properly upgraded, so that it works with both the old and the new version.