Elasticsearch is great, but optimizing it for high load is always tricky. This won’t be yet another “Tips and tricks for optimizing Elasticsearch” article – there are many great ones out there. I’m going to focus on one narrow use-case – multitenant systems, i.e. those that support multiple customers/users (tenants).
You can build a multitenant search engine in three different ways:
- Cluster per tenant – this is the hardest to manage and requires a lot of devops automation. Depending on the types of customers it may be worth it to completely isolate them, but that’s rarely the case
- Index per tenant – this can be fine initially, and requires little additional coding (you just parameterize the “index” parameter in the URL of the queries), but it’s likely to cause problems as the customer base grows. Also, supporting consistent mappings and settings across indexes may be trickier than it sounds (e.g. some may reject an update and others may not depending on what’s indexed). Moving data to colder indexes also becomes more complex.
- Tenant-based routing – this means you put everything in one cluster but you configure your search routing to be tenant-specific, which allows you to logically isolate data within a single index.
The last one seems to be the preferred option in general. What is routing? The Elasticsearch blog has a good overview and documentation. The idea lies in the way Elasticsearch handles indexing and searching – it splits data into shards (each shard is a separate Lucene index and can be replicated on more than one node). A shard is a logical grouping within a single Elasticsearch node. When no custom routing is used, and an index request comes, the ID is used to determine which shard is going to be used to store the data. However, during search, Elasticsearch doesn’t know which shards have the data, so it has ask multiple shards and gather the results. Related to that, there’s the newly introduced adaptive replica selection, where the proper shard replica is selected intelligently, rather than using round-robin.
Custom routing allows you to specify a routing value when indexing a document and then a search can be directed only to the shard that has the same routing value. For example, at LogSentinel when we index a log entry, we use the data source id (applicationId) for routing. So each application (data source) that generates logs has a separate identifier which allows us to query only that data source’s shard. That way, even though we may have a thousand clients with a hundred data sources each, a query will be precisely targeted to where the data for that particular customer’s data source lies.
This is key for horizontally scaling multitenant applications. When there’s terabytes of data and billions of documents, many shards will be needed (in order to avoid large and heavy shards that cause performance issues). Finding data in this haystack requires the ability to know where to look.
Note that you can (and probably should) make routing required in these cases – each indexed document must be required to have a routing key, otherwise an implementation oversight may lead to a slow index.
Using custom routing you are practically turning one large Elasticsearch cluster into smaller sections, logically separated based on meaningful identifiers. In our case, it is not a userId/customerId, but one level deeper – there are multiple shards per customer, but depending on the use-case, it can be one shard per customer, using the userId/customerId. Using more than one shard per customer may complicate things a little – for example having too many shards per customer may require searches that span too many shards, but that’s not necessarily worse than not using routing.
There are some caveats – the isolation of customer data has to be handled in the application layer (whereas for the first two approaches data is segregated operationally). If there’s an application bug or lack of proper access checks, one user can query data from other users’ shards by specifying their routing key. It’s the role of the application in front of Elasticsearch to only allow queries with routing keys belonging to the currently authenticated user.
There are cases when the first two approaches to multitenancy are viable (e.g. a few very large customers), but in general the routing approach is the most scalable one.