AWS Alarms for Application Errors

Monitoring is key for any real-world application. You have to know what’s happening and be alerted in real time if something wrong is happening. AWS has CloudWatch for that, and gives you a lot of metrics automatically. But there are some that you have to define yourself. And then you need to define proper alarms. Here I’ll focus on hour: High number of application errors High number of application warnings High number of 5xx errors on the load balancer High number of 4xx errors on the load balancer First, the prerequisites: You need to be using CloudFormation to automate everything. You can create all of those things manually, but automation is a big plus If using CloudFormation, you’d preferably have a sub-stack for configuring alarms You need to be collecting your logs with CloudWatch logs If you are not using CloudWatch logs, here’s a simple config file and script to enable them: { "agent": { "metrics_collection_interval": 10, "region": "eu-west-1", "logfile": "/opt/aws/amazon-cloudwatch-agent/logs/amazon-cloudwatch-agent.log" }, "logs": { "logs_collected": { "files": { "collect_list": [ { "file_path": "{{logPath}}", "log_group_name": "{{logGroupName}}", "log_stream_name": "{instance_id}", "timestamp_format": "%Y-%m-%d %H:%M:%S" } ] } } } } # install AWS CloudWatch monitor mkdir cloud-watch-agent cd cloud-watch-agent wget https://s3.amazonaws.com/amazoncloudwatch-agent/linux/amd64/latest/AmazonCloudWatchAgent.zip unzip AmazonCloudWatchAgent.zip ./install.sh aws s3 cp s3://$BUCKET_NAME/cloudwatch-agent-config.json /var/config/cloudwatch-agent-config.json sed -i -- 's|{{logPath}}|/var/log/application.log|g' /var/config/cloudwatch-agent-config.json sed -i -- 's|{{logGroupName}}|app_node|g' /var/config/cloudwatch-agent-config.json sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c file:/var/config/cloudwatch-agent-config.json -s Now you have to define two things: Log metrics and alarms. The cloudformation code below creates both: "HighAppErrorsNotification": { "Type": "AWS::CloudWatch::Alarm", "Properties": { "AlarmActions": [ { "Ref": "NotificationTopicId" } ], "InsufficientDataActions": [ { "Ref": "NotificationTopicId" } ], "AlarmDescription": "Notify if there are too many application...

An AWS Elasticsearch Post-Mortem

So it happened that we had a production issue on the SaaS version of LogSentinel – our Elasticsearch stopped indexing new data. There was no data loss, as elasticsearch is just a secondary storage, but it caused some issues for our customers (they could not see the real-time data on their dashboards). Below is a post-mortem analysis – what happened, why it happened, how we handled it and how we can prevent it. Let me start with a background of how the system operates – we accept audit trail entries (logs) through a RESTful API (or syslog), and push them to a Kafka topic. Then the Kafka topic is consumed to store the data in the primary storage (Cassandra) and index it for better visualization and analysis in Elasticsearch. The managed AWS Elasticsearch service was chosen because it saves you all the overhead of cluster management, and as a startup we want to minimize our infrastructure management efforts. That’s a blessing and a curse, as we’ll see below. We have alerting enabled on many elements, including the Elasticsearch storage space and the number of application errors in the log files. This allows us to respond quickly to issues. So the “high number of application errors” alarm triggered. Indexing was blocked due to FORBIDDEN/8/index write. We have a system call that enables it, so I tried to run it, but after less than a minute it was blocked again. This meant that our Kafka consumers failed to process the messages, which is fine, as we have a sufficient message retention period in Kafka, so no data can be lost. I...

Running a Safe Database Cluster in AWS With Auto-Scaling Groups

When you have to run a scalable application on AWS, your database must also be scalable. It’s easier to scale the stateless application layer, where each node is mostly disposable – even if a node in a 3-node cluster fails, you can just fire up another one and nobody notices. The database layer is stateful and therefore there’s a risk to lose data. Having just a single node is not an option, as a node can always go down and that would mean downtime. So you need multiple nodes in a cluster to make sure your application is highly available and fault tolerant (I won’t go into the differences in terminology). What database am I talking about? It doesn’t matter. It can be a SQL or a NoSQL database – each has some form of clustering available. Whether it’s active-active or active-passive. Now, for AWS in particular, you can choose RDS (or another managed option), which will handle it for you. But if there’s no managed option (e.g. Cassandra) or you don’t feel the managed option is giving you enough control, or is more expensive, or the version you require is not available, you have to manage the database layer yourself. I won’t go into the details of how to configure the database-specific clustering – you should check the documentation of the particular database for that. I’ll try to give some tips how to safely run your infrastructure that supports the database cluster. And here come auto-scaling groups. They allow you have have a group of identical nodes (based on a launch configuration) and the ASG makes sure you...