You may be familiar with the “noisy neighbour” problem with virtualization – someone else’s instances on the same physical machine “steals” CPU from your instance. I won’t be giving clues how to solve the issue (quick & dirty – by terminating your instance and letting it be spawned on another physical machine), and instead I’ll explain my observations “at scale”.
I didn’t actually experience a typical “noisy neighbour” on AWS – i.e. one instance being significantly “clogged”. But as I noted in an earlier post about performance benchmarks, the overall AWS performance depends on many factors.
The time of day is the obvious one – as I’m working from UTC+2, my early morning is practically the time when Europe has not yet woken up, and the US has already gone to sleep. So the load on AWS is expected to be lower. When I experiment with CloudFormation stacks in the morning and in the afternoon, the difference is quite noticeable (though I haven’t measured it) – “morning” stacks are up and running much faster than “afternoon” ones. It takes less time for instances, ELBs, and the whole stack to be created.
But last week we observed something rather curious. Our regular load test had to be run on Thursday, but then and till the end of the week the performance was horrible – we couldn’t even get a healthy run – many, many requests were failing due to timeouts from internal ELBs. Ontop of that, spot instances (instances for which you bid a certain price and someone else can “steal” from you at any time) were rather hard to keep – there was a huge demand for them and our spot instances were constantly claimed by someone else. But the AWS region was reported to be in an “ok” state, no errors.
What was happening last Thursday? The UK elections. I can’t prove it had such an effect on the whole AWS, and I initially have that as a joke explanation, but an EU AWS region during the UK elections is likely to be experiencing high load. Noticeably high load, as it seems, so that the whole infrastructure for everyone else was under pressure. (It might have been a coincidence, of course). And it wasn’t a typical “noisy neighbour” – it was the ELBs that were not performing. And then, this week, things were back to normal.
The AWS infrastructure is complex, it has way more than just “instances”, so even if you have enough CPU to handle noisy neighbours, any other component can suffer from increased load on the whole infrastructure. E.g. ELBs, RDS, SQS, S3, even your VPC subnets. When AWS is under pressure, you’ll feel it, one way or another.
The moral? Embrace failure, of course. Have monitoring that would notify you of these events of a less stable infrastructure, and have a fault-tolerant setup with proper retries and fallbacks.