Event Logs

Most system have some sort of event logs – i.e. what has happened in the system and who did it. And sometimes it has a dual existence – once as an “audit log”, and once as event log, which is used to replay what has happened. These are actually two separate concepts: the audit log is the trace that every action leaves in the system so that the system can later be audited. It’s preferable that this log is somehow secured (will discuss that another time) the event log is a crucial part of the event-sourcing model where the database only stores modifications, rather than the current state. The current state is the obtained after applying all the stored modifications until the present moment. This allows seeing the state of the data at any moment in the past. There are a bunch of ways to get that functionality. For audit logs there is Hibernate Envers, which stores all modifications in a separate table. You can also have a custom solution using spring aspects or JPA listeners (that store modifications in an audit log table whenever a change happens). You can store the changes in multiple ways – as key/value rows (one row per modified field), or as objects serialized to JSON. Event-sourcing can be achieved by always inserting a new record instead of updating or deleting (and incrementing a version and/or setting a “current” flag). There some event-sourcing-native databases – Datomic and Event Store. (Note: “event-sourcing” isn’t equal to “insert-only”, but the approach is very similar) They still seem pretty similar – both track what has happened on the...

Spring Boot, @EnableWebMvc And Common Use-Cases

It turns out that Spring Boot doesn’t mix well with the standard Spring MVC @EnableWebMvc. What happens when you add the annotation is that spring boot autoconfiguration is disabled. The bad part (that wasted me a few hours) is that in no guide you can find that explicitly stated. In this guide it says that Spring Boot adds it automatically, but doesn’t say what happens if you follow your previous experience and just put the annotation. In fact, people that are having issues stemming from this automatically disabled autoconfiguration, are trying to address it in various ways. Most often – by keeping @EnableWebMvc, but also extending Spring Boot’s WebMvcAutoConfiguration. Like here, here and somewhat here. I found them after I got the idea and implemented it that way. Then realized doing it is redundant, after going through Spring Boot’s code and seeing that an inner class in the autoconfiguration class has a single-line javadoc stating Configuration equivalent to {@code @EnableWebMvc}. That answered my question whether spring boot autoconfiguration misses some of the EnableWebMvc “features”. And it’s good that they extended the class that provides EnableWebMvc, rather than mirroring the functionality (which is obvious, I guess). What should you do when you want to customize your beans? As usual, extend WebMvcConfigurerAdapter (annotate the new class with @Component) and do your customizations. So, bottom line of the particular problem: don’t use @EnableWebMvc in spring boot, just include spring-web as a maven/gradle dependency and it will be autoconfigured. The bigger picture here resulted in me adding a comment in the main configuration class detailing why @EnableWebMvc should not be put there. So...

Distributed Cache – Overview

What’s a distributed cache? A solution that is “deployed” in an application (typically a web application) and that makes sure data is loaded from memory, rather than from disk (which is much slower), in order to improve performance and response time. That looks easy if the cache is to be used on a single machine – you just load your most active data from the database in memory (e.g. a Guava Cache instance), and serve it from there. It becomes a bit more complicated when this has to work in a cluster – e.g. 5 application nodes serving requests to users in a round-robin fashion. You have to update the in-memory cache on all machines each time a piece of data is updated by a request to one of the machines. If you just load all the data in memory and don’t invalidate it, the cache won’t be “coherent” – it will have stale values and requests to different application nodes will have different results, which you most certainly want to avoid. Or you can have a single big cache server with tons of memory, but it can die – and that may disrupt the smooth operation, so you’d want to have at least 2 machines in a cluster. You can get a distributed cache in different ways. To list a few: Infinispan (which I’ve covered previously), Terracotta/Ehcache, Hazelcast, Memcached, Redis, Cassandra, Elasticache(by Amazon). The former three are Java-specific (both JCache compliant), but the rest can be used in any setup. Cassandra wasn’t initially meant to be cache solution, but it can easily be used as such. All of...

Distributing Election Volunteers In Polling Stations

There’s an upcoming election in my country, and I’m a member of the governing body of one of the new parties. As we have a lot of focus on technology (and e-governance), our internal operations are also benefiting from some IT skills. The particular task at hand these days was to distribute a number of election day volunteers (that help observe the fair election process) to polling stations. And I think it’s an interesting technical task, so I’ll try to explain the process. First – data sources. We have an online form for gathering volunteer requests. And second, we have local coordinators that collect volunteer declarations and send them centrally. Collecting all the data is problematic (to this moment), because filling the online form doesn’t make you eligible – you also have to mail a paper declaration to the central office (horrible bureaucracy). Then there’s the volunteer preferences – in the form they’ve filled whether they are willing to travel, or they prefer their closest poling station. And then there’s the “priority” polling stations, which are considered to be more risky and therefore we need volunteers there. I decided to do the following: Create a database table “volunteers” that holds all the data about all prospective volunteers Import all data – using apache CSV parser, parse the CSV files (converted from Google sheets) with the 1. online form 2. data from the received paper declarations Match the entries from the two sources by full name (as the declarations cannot contain an email, which would otherwise be the primary key) Geocode the addresses of people Import all polling stations and...

“Infinity” is a Bad Default Timeout

Many libraries wrap some external communication. Be it a REST-like API, a message queue, a database, a mail server or something else. And therefore you have to have some timeout – for connecting, for reading, writing or idling. And sadly, many libraries have their default timeouts set to “0” or “-1” which means “infinity”. And that is a very useless and even harmful default. There isn’t a practical use case where you’d want to hang on forever waiting for a resource. And there are tons of situations where this can happen, e.g. the other end gets stuck. In the past 3 months I had 2 libraries that have a default timeout of “infinity” and that eventually lead to production problems because we’ve forgotten to configure them properly. Sometimes you even don’t see the problem, until a thread pool gets exhausted. So, I have a request to API/library designers (as I’ve done before – against property maps and encoding other than UTF-8). Never have “infinity” as a default timeout. Your library will thus cause lots of production issues. Also note that it’s sometimes an underlying HTTP client (or Socket) that doesn’t have a reasonable default – it’s still your job to fix that when wrapping it. What default should you provide? Reasonable. 5 seconds maybe? You may (rightly) say you don’t want to impose an arbitrary timeout on your users. In that case I have a better proposal: Explicitly require a timeout for building your “client” (because these libraries are most often clients for some external system). E.g. Client.create(url, credentials, timeout). And fail if no timeout is provided. That makes...

Protecting Sensitive Data

If you are building a service that stores sensitive data, your number one concern should be how to protect it. What IS sensitive data? There are some obvious examples, like medical data or bank account data. But would you consider a dating site database as sensitive data? Based on a recent leaks of a big dating site I’d say yes. Is a cloud turn-by-turn nagivation database sensitive? Most likely, as users journeys are stored there. Facebook messages, emails, etc – all of that can and should be considered sensitive. And therefore must be highly protected. If you’re not sure if the data you store is sensitive, assume it is, just in case. Or a subsequent breach can bring your business down easily. Now, protecting data is no trivial feat. And certainly cannot be covered in a single blog post. I’ll start with outlining a few good practices: Don’t dump your production data anywhere else. If you want a “replica” for testing purposes, obfuscate the data – replace the real values with fakes ones. Make sure access to your servers is properly restricted. This includes using a “bastion” host, proper access control settings for your administrators, key-based SSH access. Encrypt your backups – if your system is “perfectly” secured, but your backups lie around unencrypted, they would be the weak spot. The decryption key should be as protected as possible (will discuss it below) Encrypt your storage – especially if using a cloud provider, assume you can’t trust it. AWS, for example, offers EBS encryption, which is quite good. There are other approaches as well, e.g. using LUKS with keys...