by Bozho | Aug 27, 2020 | Aggregated, big data, compression, Developer tips
I’d like to share something very brief and very obvious – that compression works better with large amounts of data. That is, if you have to compress 100 sentences you’d better compress them in bulk rather than once sentence at a time. Let me illustrate that: public static void main(String[] args) throws Exception { List<String> sentences = new ArrayList<>(); for (int i = 0; i < 100; i ++) { StringBuilder sentence = new StringBuilder(); for (int j = 0; j < 100; j ++) { sentence.append(RandomStringUtils.randomAlphabetic(10)).append(" "); } sentences.add(sentence.toString()); } byte[] compressed = compress(StringUtils.join(sentences, ". ")); System.out.println(compressed.length); System.out.println(sentences.stream().collect(Collectors.summingInt(sentence -> compress(sentence).length))); } The compress method is using commons-compress to easily generate results for multiple compression algorithms: public static byte[] compress(String str) { if (str == null || str.length() == 0) { return new byte[0]; } ByteArrayOutputStream out = new ByteArrayOutputStream(); try (CompressorOutputStream gzip = new CompressorStreamFactory() .createCompressorOutputStream(CompressorStreamFactory.GZIP, out)) { gzip.write(str.getBytes("UTF-8")); gzip.close(); return out.toByteArray(); } catch (Exception ex) { throw new RuntimeException(ex); } } The results are as follows, in bytes (note that there’s some randomness, so algorithms are not directly comparable): Algorithm Bulk Individual GZIP 6590 10596 LZ4_FRAMED 9214 10900 BZIP2 6663 12451 Why is that an obvious result? Because of the way most compression algorithms work – they look for patterns in the raw data and create a map of those patterns (a very rough description). How is that useful? In big data scenarios where the underlying store supports per-record compression (e.g. a database or search engine), you may save a significant amount of disk space if you bundle multiple records into one stored/indexed record. This is not...
Recent Comments