A Disk-Backed ArrayList

It sometimes happens that your list can become too big to fit in memory and you have to do something in order to avoid running out of memory.

The proper way to do that is streaming – instead of fitting everything in memory, you should stream data from the source and discard the entries that are already processed.

However, there are cases when code that’s outside of your control requires a List and you can’t use streaming. These cases are rather rare but in case you hit them, you have to find a workaround. One is to re-implement the code to work with streaming, but depending on the way the library is written, it may not be possible. So the other option is to use a disk-backed list – one that works as a list, but underneath stores and loads elements from disk.

Searching for existing solutions results in several 3+ years old repos like this one and this one and this one.

And then there’s MapDB, which is great and supported. It’s mostly about maps, but it does support a List as well, as shown here.

And finally, you have the option to implement something simpler yourself, in case you need just iteration and almost nothing else. I’ve done it here – DiskBackedArrayList.java. It doesn’t support many things (not all methods are overridden to throw an exception, but they should). But most importantly, it doesn’t support random adding and random getting, and also toArray(). It’s purely “fill the collection” and then “iterate the collection”. It relies on ObjectOutputStream which is not terribly efficient, but is simple to use. Note that I’ve allowed a short in-memory prependList in case small amounts of data need to be prepended to the list.

The list gets filled in memory until a specified threshold and then gets flushed to disk, clearing the memory which starts getting filled again. This too can be more efficient – with background flushing in another thread that doesn’t interfere with adding elements to the list, but optimizations complicate things and in this case the total running time was not an issue. Most importantly, the iterator() method is overridden to return a custom iterator that first streams the prepended list, then reads everything from disk and finally iterates over the latest batch which is still in memory. And finally, the clear() method should be called in the end in order to close the underlying stream. An output stream could be opened and closed on each flush, but ObjectOutputStream can’t be used in append mode due to some implementation specific about writing headers first.

So basically we hide the streaming approach underneath a List interface – it’s still streaming elements and discarding them when not needed. Ideally this should be done at the source of the data (e.g. a database, message queue, etc.) rather than using the disk as overflow space, but there are cases where using the disk is fine. This implementation is a starting point, as it’s not tested in production, but illustrates that you can adapt existing classes to use different data access patterns if needed.

The post A Disk-Backed ArrayList appeared first on Bozho's tech blog.