by Bozho | Aug 27, 2020 | Aggregated, big data, compression, Developer tips
I’d like to share something very brief and very obvious – that compression works better with large amounts of data. That is, if you have to compress 100 sentences you’d better compress them in bulk rather than once sentence at a time. Let me illustrate that: public static void main(String[] args) throws Exception { List<String> sentences = new ArrayList<>(); for (int i = 0; i < 100; i ++) { StringBuilder sentence = new StringBuilder(); for (int j = 0; j < 100; j ++) { sentence.append(RandomStringUtils.randomAlphabetic(10)).append(" "); } sentences.add(sentence.toString()); } byte[] compressed = compress(StringUtils.join(sentences, ". ")); System.out.println(compressed.length); System.out.println(sentences.stream().collect(Collectors.summingInt(sentence -> compress(sentence).length))); } The compress method is using commons-compress to easily generate results for multiple compression algorithms: public static byte[] compress(String str) { if (str == null || str.length() == 0) { return new byte[0]; } ByteArrayOutputStream out = new ByteArrayOutputStream(); try (CompressorOutputStream gzip = new CompressorStreamFactory() .createCompressorOutputStream(CompressorStreamFactory.GZIP, out)) { gzip.write(str.getBytes("UTF-8")); gzip.close(); return out.toByteArray(); } catch (Exception ex) { throw new RuntimeException(ex); } } The results are as follows, in bytes (note that there’s some randomness, so algorithms are not directly comparable): Algorithm Bulk Individual GZIP 6590 10596 LZ4_FRAMED 9214 10900 BZIP2 6663 12451 Why is that an obvious result? Because of the way most compression algorithms work – they look for patterns in the raw data and create a map of those patterns (a very rough description). How is that useful? In big data scenarios where the underlying store supports per-record compression (e.g. a database or search engine), you may save a significant amount of disk space if you bundle multiple records into one stored/indexed record. This is not...
by Bozho | Aug 18, 2020 | Aggregated, cybersecurity, Opinions, security
There are hundreds of different information security solutions out there and choosing which one to pick can be hard. Usually decisions are driven by recommendations, vendor familiarity, successful upsells, compliance needs, etc. I’d like to share my understanding of the security landscape by providing one-line descriptions of each of the different categories of products. Note that these categories are not strictly defined sometimes and they may overlap. They may have evolved over time and a certain category can include several products from legacy categories. The explanations will be slightly simplified. For a generalization and summary, skip the list and go to the next paragraph. This post aims to summarize a lot of Gertner and Forester reports, as well as product data sheets, combined with some real world observations and to bring this to a technical level, rather than broad business-focused capabilities. I’ll split them in several groups, though they may be overlapping. Monitoring and auditing SIEM (Security Information and Event Management) – collects logs from all possible sources (applications, OSs, network appliances) and raises alarms if there are anomalies IDS (Intrusion Detection System) – listening to network packets and finding malicious signatures or statistical anomalies. There are multiple ways to listen to the traffic: proxy, port mirroring, network tap, host-based interface listener. Deep packet inspection is sometimes involved, which requires sniffing TLS at the host or terminating it at a proxy in order to be able to inspect encrypted communication (especially for TLS 1.3), effectively doing an MITM “attack” on the organization users. IPS (Intrusion Prevention System) – basically a marketing upgrade of IDS with the added option to...
by Bozho | Jul 30, 2020 | Aggregated, Developer tips, encryption, Opinions
“Encryption” has turned into a buzzword, especially after privacy standards and regulation vaguely mention it and vendors rush to provide “encryption”. But what does it mean in practice? I did a webinar (hosted by my company, LogSentinel) to explain the various aspects and pitfalls of encryption. You can register to watch the webinar here, or view it embedded below: And here are the slides: Of course, encryption is a huge topic, worth a whole course, rather than just a webinar, but I hope I’m providing good starting points. The interesting technique that we employ in our company is “searchable encryption” which allows to have encrypted data and still search in it. There are many more very nice (and sometimes niche) applications of encryption and cryptography in general, as Bruce Schneier mentions in his recent interview. These applications can solve very specific problems with information security and privacy that we face today. We only need to make them mainstream or at least increase awareness. The post Encryption Overview [Webinar] appeared first on Bozho's tech...
by Bozho | Jul 17, 2020 | Aggregated, blockchain, Developer tips, ideas, integration
Blockchain has been a buzzword for the past several years and it hasn’t lived to its promises (yet). The value proposition usually includes vague claims about trust and unmodifiability, but rarely that has brought demonstrable improvement to existing processes. There are dozens of blockchain projects, networks, protocols, “standards”, and all of them can in some way help you solve either data integrity issues (guarantee that data has not been tampered with) or multi-party trust issues (several companies participating in one process shouldn’t have to trust each other in order to have automated cross-organization business processes). However, deploying and integrating a separate blockchain solution is usually a large project in itself and especially in the COVID-19 crisis likely gets postponed because of the questionable return on investment. But for the enterprise, blockchain is largely a shared database. Sharing data with other participants in a given business process in a secure way that doesn’t allow any of the participants to cheat. And this can be achieved not by adding a whole new blockchain infrastructure that would in turn integrate with existing systems (which in many cases can’t be integrated easily because they don’t have APIs), but by “blockchainizing” the existing database. Ideally, what I’m describing should be a project itself, which is either deployed alongside the database, or as part of an application. And what it can do is as follows: Select tables and columns to share with other participants – obviously only parts of the database should be shared with others Define shared data model and data transformations – sometimes data has to be transformed, or masked, in order to...
by Bozho | Jun 21, 2020 | Aggregated, Developer tips, Opinions
If we have to integrate two (or more) systems nowadays, we know – we either use an API or, more rarely, some message queue. Unfortunately, many systems in the world do not support API integration. And many more a being created as we speak, that don’t have APIs. So when you inevitably have to integrate with them, you are left with imperfect choices to make. Below are seven patterns to integrate with legacy systems (or not-so-legacy systems that are built in legacy ways). Initially I wanted to use “bad integration patterns”. But if you don’t have other options, they are not bad – they are inevitable. What’s bad is that fact that so many systems continue to be built without integration in mind. I’ve seen all of these, on more than one occasion. And I’ve heard many more stories about them. Unfortunately, they are not the exception (fortunately, they are also not the rule, at least not anymore). Files on FTP – one application uploads files (XML, CSV, other) to an FTP (or other shared resources) and the other ones reads them via a scheduled job, parses them and optionally spits a response – either in the same FTP, or via email. Sharing files like that is certainly not ideal in terms of integration – you don’t get real-time status of your request, and other aspects are trickier to get right – versioning, high availability, authentication, security, traceability (audit trail). Shared database – two applications sharing the same database may sound like a recipe for disaster, but it’s not uncommon to see it in the wild. If you are...
Recent Comments