Releasing often is a good thing. It’s cool, and helps us deliver new functionality quickly, but I want to share one positive side-effect – it helps with analyzing production performance issues.
We do releases every 5 to 10 days and after a recent release, the application CPU chart jumped twice (the lines are differently colored because we use blue-green deployment):
What are the typical ways to find performance issues with production loads?
- Connect a profiler directly to production – tricky, as it requires managing network permissions and might introduce unwanted overhead
- Run performance tests against a staging or local environment and do profiling there – good, except your performance tests might not hit exactly the functionality that causes the problem (this is what happens in our case, as it was some particular types of API calls that caused it, which weren’t present in our performance tests). Also, performance tests can be tricky
- Do a thread dump (and heap dump) and analyze them locally – a good step, but requires some luck and a lot of experience analyzing dumps, even if equipped with the right tools
- Check your git history / release notes for what change might have caused it – this is what helped us resolve the issue. And it was possible because there were only 10 days of commits between the releases.
We could go through all of the commits and spot potential performance issues. Most of them turned out not to be a problem, and one seemingly unproblematic pieces was discovered to be the problem after commenting it out for a brief period a deploying a quick release without it, to test the hypothesis. I’ll share a separate post about the particular issue, but we would have to waste a lot more time if that release has 3 months worth of commits rather than 10 days.
Sometimes it’s not an obvious spike in the CPU or memory, but a more gradual issue that you introduce at some point and it starts being a problem a few months later. That’s what happened a few months ago, when we noticed a stead growth in the CPU with the growth of ingested data. Logical in theory, but the CPU usage grew faster than the data ingestion rate, which isn’t good.
So we were able to answer the question “when did it start growing” in order to be able to pinpoint the release that introduced the issue. because the said release had only 5 days of commits, it was much easier to find the culprit.
All of the above techniques are useful and should be employed at the right time. But releasing often gives you a hand with analyzing where a performance issues is coming from.
The post Releasing Often Helps With Analyzing Performance Issues appeared first on Bozho's tech blog.
Recent Comments