ElasticSearch Multitenancy With Routing

Elasticsearch is great, but optimizing it for high load is always tricky. This won’t be yet another “Tips and tricks for optimizing Elasticsearch” article – there are many great ones out there. I’m going to focus on one narrow use-case – multitenant systems, i.e. those that support multiple customers/users (tenants). You can build a multitenant search engine in three different ways: Cluster per tenant – this is the hardest to manage and requires a lot of devops automation. Depending on the types of customers it may be worth it to completely isolate them, but that’s rarely the case Index per tenant – this can be fine initially, and requires little additional coding (you just parameterize the “index” parameter in the URL of the queries), but it’s likely to cause problems as the customer base grows. Also, supporting consistent mappings and settings across indexes may be trickier than it sounds (e.g. some may reject an update and others may not depending on what’s indexed). Moving data to colder indexes also becomes more complex. Tenant-based routing – this means you put everything in one cluster but you configure your search routing to be tenant-specific, which allows you to logically isolate data within a single index. The last one seems to be the preferred option in general. What is routing? The Elasticsearch blog has a good overview and documentation. The idea lies in the way Elasticsearch handles indexing and searching – it splits data into shards (each shard is a separate Lucene index and can be replicated on more than one node). A shard is a logical grouping within a single Elasticsearch...

An AWS Elasticsearch Post-Mortem

So it happened that we had a production issue on the SaaS version of LogSentinel – our Elasticsearch stopped indexing new data. There was no data loss, as elasticsearch is just a secondary storage, but it caused some issues for our customers (they could not see the real-time data on their dashboards). Below is a post-mortem analysis – what happened, why it happened, how we handled it and how we can prevent it. Let me start with a background of how the system operates – we accept audit trail entries (logs) through a RESTful API (or syslog), and push them to a Kafka topic. Then the Kafka topic is consumed to store the data in the primary storage (Cassandra) and index it for better visualization and analysis in Elasticsearch. The managed AWS Elasticsearch service was chosen because it saves you all the overhead of cluster management, and as a startup we want to minimize our infrastructure management efforts. That’s a blessing and a curse, as we’ll see below. We have alerting enabled on many elements, including the Elasticsearch storage space and the number of application errors in the log files. This allows us to respond quickly to issues. So the “high number of application errors” alarm triggered. Indexing was blocked due to FORBIDDEN/8/index write. We have a system call that enables it, so I tried to run it, but after less than a minute it was blocked again. This meant that our Kafka consumers failed to process the messages, which is fine, as we have a sufficient message retention period in Kafka, so no data can be lost. I...

Near Real-Time Indexing With ElasticSearch

Choosing your indexing strategy is hard. The Elasticsearch documentation does have some general recommendations, and there are some tips from other companies, but it also depends on the particular usecase. In the typical scenario you have a database as the source of truth, and you have an index that makes things searchable. And you can have the following strategies: Index as data comes – you insert in the database and index at the same time. It makes sense if there isn’t too much data; otherwise indexing becomes very inefficient. Store in database, index with scheduled job – this is probably the most common approach and is also easy to implement. However, it can have issues if there’s a lot of data to index, as it has to be precisely fetched with (from, to) criteria from the database, and your index lags behind the actual data with the number of seconds (or minutes) between scheduled job runs Push to a message queue and write an indexing consumer – you can run something like RabbitMQ and have multiple consumers that poll data and index it. This is not straightforward to implement because you have to poll multiple items in order to leverage batch indexing, and then only mark them as consumed upon successful batch execution – somewhat transactional behaviour. Queue items in memory and flush them regularly – this may be good and efficient, but you may lose data if a node dies, so you have to have some sort of healthcheck based on the data in the database Hybrid – do a combination of the above; for example if you...