When you have to run a scalable application on AWS, your database must also be scalable. It’s easier to scale the stateless application layer, where each node is mostly disposable – even if a node in a 3-node cluster fails, you can just fire up another one and nobody notices.
The database layer is stateful and therefore there’s a risk to lose data. Having just a single node is not an option, as a node can always go down and that would mean downtime. So you need multiple nodes in a cluster to make sure your application is highly available and fault tolerant (I won’t go into the differences in terminology).
What database am I talking about? It doesn’t matter. It can be a SQL or a NoSQL database – each has some form of clustering available. Whether it’s active-active or active-passive.
Now, for AWS in particular, you can choose RDS (or another managed option), which will handle it for you. But if there’s no managed option (e.g. Cassandra) or you don’t feel the managed option is giving you enough control, or is more expensive, or the version you require is not available, you have to manage the database layer yourself. I won’t go into the details of how to configure the database-specific clustering – you should check the documentation of the particular database for that. I’ll try to give some tips how to safely run your infrastructure that supports the database cluster.
And here come auto-scaling groups. They allow you have have a group of identical nodes (based on a launch configuration) and the ASG makes sure you always have at least X healthy nodes by starting new nodes if existing ones fail( they can automatically kill unhealthy nodes (that is, nodes that do not respond to automated healthchecks)).
That’s awesome for application nodes, but it can be an issue with database nodes. You don’t necessarily want your database node killed if it gets unresponsive for a while. That’s why below I’ve compiled a list of tips to avoid pitfalls. Unfortunately, many of them are not available through CloudFormation, so you have to do them manually. And document them so that you don’t forget in case you need to recreate the stack:
- Set the minimum number of nodes to 1. It protects from accidentally setting the “Desired” count to 0 as part of experimenting with other, unrelated ASGs
- Make sure you have enabled termination protection for each instance as well as scale-in termination protection per ASG.
- In the ASG settings there’s “Suspended Processes”. Make sure you suspend “Terminate” and “ReplaceUnhealthy”.
- Make sure that in your LaunchConfiguration the EBS volume is not deleted in case of termination. Why do you need that, given that you have disabled all termination options? Well, termination can occasionally happen due to issues with the underlying host, or a node may be scheduled for decommission
- If at some point you need to restore from an EBS volume, 1. let the ASG spawn a new node 2. temporarily add “Launch” to the suspended actions 3. Detach the root volume of the node 4. attach the old EBS volume to /dev/xvda 5. start the node.
- Setup a lifecycle policy (through CloudFormation or manually) to do a backup on the database EBS volumes. Make sure you set the proper tags to the volumes (and this can only be done manually)
- Make sure the ASG can spawn instances in multiple availability zones (in case one goes down)
If you follow that, your auto-scaling groups will not behave exactly as auto-scaling groups. You can still configure automatically increasing the number of nodes in case of increased load, but the rest of the features are rarely a good idea for database layers – you’d prefer to resolve your database issues on existing machines, even if stopped temporarily, rather than just spawn new ones.
But you should embrace failure. Even with all termination protections, you have to assume everything may fail and die and you should have a clear path to restoring your nodes.
The post Running a Safe Database Cluster in AWS With Auto-Scaling Groups appeared first on Bozho's tech blog.