Resources on Distributed Hash Tables

Distributed p2p technologies have always been fascinating to me. Bittorrent is cool not because you can download pirated content for free, but because it’s an amazing piece of technology. At some point I read and researched a lot about how DHTs (distributed hash tables) work. DHTs are not part of the original bittorrent protocol, but after trackers were increasingly under threat to be closed for copyright infringment, “trackerless” features were added to the protocol. A DHT is distributed among all peers and holds information about which peer holds what data. Once you are connected to a peer, you can query it for their knowledge on who has what. During my research (which was with no particular purpose) I took a note on many resources that I thought useful for understanding how DHTs work and possibly implementing something ontop of them in the future. In fact, a DHT is a “shared database”, “just like” a blockchain. You can’t trust it as much, but proving digital events does not require a blockchain anyway. My point here is – there is a lot more cool stuff to distributed / p2p systems than blockchain. And maybe way more practical stuff. It’s important to note that the DHT used in BitTorrent is Kademlia. You’ll see a lot about it below. Anyway, the point of this post is to share the resources that I collected. For my own reference and for everyone who wants to start somewhere on the topic of DHTs. Bittorrent DHT protocol – a nice explanation how DHT is used in bittorent (here’s a list of all bittorrent protocol enhancements) Kademlia: design...

Algorithmic and Technological Transparency

Today I had a talk on OpenFest about algorithmic and technological transparency. It was a somewhat abstract and high-level talk but I hoped to make it more meaningful than if some random guy just spoke about techno-dystopias. I don’t really like these kinds of talks where someone who has no idea what a “training set” is, how dirty the data is and how to write five lines in Python talks about an AI revolution. But I think being a technical person let’s me put some details into the high-level stuff and make it less useless. The problem I’m trying to address is opaque systems – we don’t know how recommendation systems work, why are seeing certain videos or ads, how decisions are taken, what happens with our data and how secure the systems we use are, including our cars and future self-driving cars. Algorithms have real impact on society – recommendation engines make conspiracy theories mainstream, motivate fascists, create echo chambers. Algorithms decide whether we get loans, welfare, a conviction, or even if we get hit by a car (as in the classical trolley problem). How do we make the systems more transparent, though? There are many approaches, some overlapping. Textual description of how things operate, tracking the actions of back-office admins and making them provable to third parties, implementing “Why am I seeing this?”, tracking each step in machine learning algorithms, publishing datasets, publishing source code (as I’ve previously discussed open-sourcing some aspects of self-driving cars). We don’t have to reveal trade secrets and lose competitive advantage in order to be transparent. Transparency doesn’t have to come at...

Automate Access Control for User-Specific Entities

Practically every web application is supposed to have multiple users and each user has some data – posts, documents, messages, whatever. And the most obvious thing to do is to protect these entities from being obtained by users that are not the rightful owners of these resources. Unfortunately, this is not the easiest thing to do. I don’t mean it’s hard, it’s just not as intuitive as simply returning the resources. When you are your /record/{recordId} endpoint, a database query for the recordId is the immediate thing you do. Only then comes the concern of checking whether this record belongs to the currently authenticated user. Frameworks don’t give you a hand here, because this access control and ownership logic is domain-specific. There’s no obvious generic way to define the ownership. It depends on the entity model and the relationships between entities. In some cases it can be pretty complex, involving a lookup in a join table (for many-to-many relationships). But you should automate this, for two reasons. First, manually doing these checks on every endpoint/controller method is tedious and makes the code ugly. Second, it’s easier to forget to add these checks, especially if there are new developers. You can do these checks in several places, all the way to the DAO, but in general you should fail as early as possible, so these checks should be on a controller (endpoint handler) level. In the case of Java and Spring, you can use annotations and a HandlerInterceptor to automate this. In case of any other language or framework, there are similar approaches available – some pluggable way to describe...

Random App Ideas

Every now and then you start thinking “wouldn’t it be nice to have an app for X”. When I was in that situation, I took a note. Then that note grew, I cleaned up some absurd ones and added some more. I implemented some of these ideas of mine, the rest formed a “TODO” list. At some point I realized I won’t be able to implement them, as the time needed is not something I’m likely to have in the near future. But I didn’t like to just delete the notes. So here they are – my random app ideas. Probably useless, but maybe a little interesting. Receipts via smartphone NFC – there are apps that let you track your expenses, but they are tedious. There are apps that try to OCR receipts, but they vary too much in different parts of the world for that to be consistent (there are companies like Receipt bank that do something like that, but it’s not the way I’d like this problem solved in the long run). So I thought stores can offer the option for NFC receipts – you just tap your phone to a device (or another phone) and get the receipt in an electronic form. Not the picture, but the raw data – items, prices, taxes, issuer. Then you have to option to print it if you like. Of course I realized at some point that first legislation should allow for that – in many cases you must issue a paper receipt and the digital one is not “legal”. But the idea still remains viable and probably not hard...

Scaling Horizontally on AWS [talk]

On a recent conference (HackConf) I gave a talk where I tried to summarize how to do deployment and horizontal scaling on AWS. It is an overview of AWS resources (instance, load balancers, auto-scaling groups, security groups) as well as how to use CloudFormation to script your stack. It briefly mentions the application layer and how it should look like (because another talk on the same conference was focused on that part). My point here is summarized as: ““You cannot scale an unscalable application”. But the talk continues to discuss AWS specific things, although many of them have their nearly identical counterparts in other IaaS providers (e.g. Google Cloud, Azure). The video of the talk can be seen here: And the slides are here: As someone summarized on twitter: “That talk is approximately a year worth of learning experience with AWS in 40 minutes”. This is a benefit and a drawback, as it might be too condensed and too shallow, but I think I’ve covered important bits with enough depth for a starting point. One of my points was that for simpler setups you don’t need fancy tools and platforms (docker, kubernetes, etc.) – as you’ll have to use bash anyway, you can go with just bash + CloudFormation and have a perfectly good, highly-available, blue-green deployment setup. The other main points where: “think about your infrastructure as code”, and “consider all your resources dispensable, as they will surely die at some point”. Overall, I hope the talk is useful for everyone using or planning to use AWS, or any other IaaS provider. The post Scaling Horizontally on AWS...