Fix Your Crawler

Every now and then I open the admin panel of my blog hosting and ban a few IPs (after I’ve tried messaging their abuse email, if I find one). It is always IPs that are generating tons of requests (and traffic) – most likely running some home-made crawler. In some cases the IPs belong to an actual service that captures and provides content, in other cases it’s just a scraper for unknown reasons. I don’t want to ban IPs, especially because that same IP may be reassigned to a legitimate user (or network) in the future. But they are increasing my hosting usage, which in turn leads to the hosting provider suggesting an upgrade in the plan. And this is not about me, I’m just an example – tons of requests to millions of sites are … useless. My advice (and plea) is this – please fix your crawlers. Or scrapers. Or whatever you prefer to call that thing that programmatically goes on websites and gets their content. How? First, reuse an existing crawler. No need to make something new (unless there’s a very specific use-case). A good intro and comparison can be seen here. Second, make your crawler “polite” (the “politeness” property in the article above). Here’s a good overview on how to be polite, including respect for robots.txt. Existing implementations most likely have politeness options, but you may have to configure them. Here I’d suggest another option – set a dynamic crawl rate per website that depends on how often the content is updated. My blog updates 3 times a month – no need to crawl it...

OWASP Dependency Check Maven Plugin – a Must-Have

I have to admit with a high degree of shame that I didn’t know about the OWASP dependency check maven plugin. And seems to have been around since 2013. And apparently a thousand projects on GitHub are using it already. In the past I’ve gone manually through dependencies to check them against vulnerability databases, or in many cases I was just blissfully ignorant about any vulnerabilities that my dependencies had. The purpose of this post is just that – to recommend the OWASP dependency check maven plugin as a must-have in practically every maven project. (There are dependency-check tools for other build systems as well). When you add the plugin it generates a report. Initially you can go and manually upgrade the problematic dependencies (I upgraded two of those in my current project), or suppress the false positives (e.g. the cassandra library is marked as vulnerable, whereas the actual vulnerability is that Cassandra binds an unauthenticated RMI endpoint, which I’ve addressed via my stack setup, so the library isn’t an issue). Then you can configure a threshold for vulnerabilities and fail the build if new ones appear – either by you adding a vulnerable dependency, or in case a vulnerability is discovered in an existing dependency. All of that is shown in the examples page and is pretty straightforward. I’d suggest adding the plugin immediately, it’s a must-have: <plugin> <groupId>org.owasp</groupId> <artifactId>dependency-check-maven</artifactId> <version>3.0.2</version> <executions> <execution> <goals> <goal>check</goal> </goals> </execution> </executions> </plugin> Now, checking dependencies for vulnerabilities is just one small aspect of having your software secure and it shouldn’t give you a false sense of security (a sort-of “I have...

Using Trusted Timestamping With Java

Trusted timestamping is the process of having a trusted third party (“Time stamping authority”, TSA) certify the time of a given event in electronic form. The EU regulation eIDAS gives these timestamps legal strength – i.e. nobody can dispute the time or the content of the event if it was timestamped. It is applicable to multiple scenarios, including timestamping audit logs. (Note: timestamping is not sufficient for a good audit trail as it does not prevent a malicious actor from deleting the event altogether) There are a number of standards for trusted timestamping, the core one being RFC 3161. As most RFCs it is hard to read. Fortunately for Java users, BouncyCastle implements the standard. Unfortunately, as with most security APIs, working with it is hard, even abysmal. I had to implement it, so I’ll share the code needed to timestamp data. The whole gist can be found here, but I’ll try to explain the main flow. Obviously, there is a lot of code that’s there to simply follow the standard. The BouncyCastle classes are a maze that’s hard to navigate. The main method is obviously timestamp(hash, tsaURL, username, password): public TimestampResponseDto timestamp(byte[] hash, String tsaUrl, String tsaUsername, String tsaPassword) throws IOException { MessageImprint imprint = new MessageImprint(sha512oid, hash); TimeStampReq request = new TimeStampReq(imprint, null, new ASN1Integer(random.nextLong()), ASN1Boolean.TRUE, null); byte[] body = request.getEncoded(); try { byte[] responseBytes = getTSAResponse(body, tsaUrl, tsaUsername, tsaPassword); ASN1StreamParser asn1Sp = new ASN1StreamParser(responseBytes); TimeStampResp tspResp = TimeStampResp.getInstance(asn1Sp.readObject()); TimeStampResponse tsr = new TimeStampResponse(tspResp); checkForErrors(tsaUrl, tsr); // validate communication level attributes (RFC 3161 PKIStatus) tsr.validate(new TimeStampRequest(request)); TimeStampToken token = tsr.getTimeStampToken(); TimestampResponseDto response = new TimestampResponseDto(); response.setTime(getSigningTime(token.getSignedAttributes())); response.setEncodedToken(Base64.getEncoder().encodeToString(token.getEncoded()));...

GDPR – A Practical Guide For Developers

You’ve probably heard about GDPR. The new European data protection regulation that applies practically to everyone. Especially if you are working in a big company, it’s most likely that there’s already a process for gettign your systems in compliance with the regulation. The regulation is basically a law that must be followed in all European countries (but also applies to non-EU companies that have users in the EU). In this particular case, it applies to companies that are not registered in Europe, but are having European customers. So that’s most companies. I will not go into yet another “12 facts about GDPR” or “7 myths about GDPR” posts/whitepapers, as they are often aimed at managers or legal people. Instead, I’ll focus on what GDPR means for developers. Why am I qualified to do that? A few reasons – I was advisor to the deputy prime minister of a EU country, and because of that I’ve been both exposed and myself wrote some legislation. I’m familiar with the “legalese” and how the regulatory framework operates in general. I’m also a privacy advocate and I’ve been writing about GDPR-related stuff in the past, i.e. “before it was cool” (protecting sensitive data, the right to be forgotten). And finally, I’m currently working on a project that (among other things) aims to help with covering some GDPR aspects. I’ll try to be a bit more comprehensive this time and cover as many aspects of the regulation that concern developers as I can. And while developers will mostly be concerned about how the systems they are working on have to change, it’s not unlikely...

The Problem Solver

I’ll start this post with a quote: "Every great developer you know got there by solving problems they were unqualified to solve until they actually did it." – Patrick McKenzie — The Practical Dev (@ThePracticalDev) February 14, 2017 Good developers are good problem solvers. They turn each task into a series of problems they have to solve. They don’t necessarily know how to solve them in advance, but they have their toolbox of approaches, shortcuts and other tricks that lead to the solution. I have outlined one such set of steps for identifying problems, but you can’t easily formalize the problem-solving approach. But is really turning a task into a set of problems a good idea? Programming can be seen as a creative exercise, rather than a problem solving one – you think, you ponder, you deliberate, then you make something out of nothing and it’s beautiful, because it works. And sometimes programming is that, but that is almost always interrupted by a series of problems that stop you from getting the task completed. That process is best visualized with the following short video: That’s because most things in software break. They either break because there are unknowns, or because of a lot of unsuspected edge cases, or because the abstraction that we use leaks, or because the tools that we use are poorly documented or have poor APIs/UIs, or simply because of bugs. Or in many cases – all of the above. So inevitably, we have to learn to solve problems. And solving them quickly and properly is in fact, one might argue, the most important skill when...

I Still Prefer Eclipse Over IntelliJ IDEA

Over the years I’ve observed an inevitable shift from Eclipse to IntelliJ IDEA. Last year they were almost equal in usage, and I have the feeling things are swaying even more towards IDEA. IDEA is like the iPhone of IDEs – its users tell you that “you will feel how much better it is once you get used to it”, “are you STILL using Eclipse??”, “IDEA is so much better, I thought everyone has switched”, etc. I’ve been using mostly Eclipse for the past 12 years, but in some cases I did use IDEA – when I was writing Scala, when I was writing Android, and most recently – when Eclipse failed to be ready for the Java 9 release, so after half a day of trying to get it working, I just switched to IDEA until Eclipse finally gets a working Java 9 version (with Maven and the rest of the stuff). But I will get back to Eclipse again, soon. And I still prefer it. Not just because of all the key combinations I’ve internalized (you can reuse those in IDEA), but because there are still things I find worse in IDEA. Of course, IDEA has so much more cool features like code improvement suggestions and actually working plugins for everything. But at least some of the problems I see have to do with the more basic development workflow and experience. And you can’t compensate for those with sugarcoating. So here they are: Projects are not automatically built (by default), so you can end up with compilation errors that you don’t see until you open a non-compiling...