The Low Quality of Scientific Code

Recently I’ve been trying to get a bit into music theory, machine learning, computational linguistics, so I ended up looking at libraries and tools written by the scientific community – examples include the Stanford Core NLP library, GATE, Weka, jMusic, and several more.

The general feeling is that scientific libraries have mostly bad code. I will not point fingers, but there are too many freshman mistakes – not considering thread-safety, cryptic, ugly and/or stringly-typed APIs, lack of type-safety, poorly named variables and methods, choosing bad/slow serialization formats, writing debug messages to System.err (or out), lack of documentation, lack of tests.

Thus using these libraries becomes time consuming and error prone. Every 10 minutes you see some horribly written code that you don’t have the time to fix. And it’s not just one or two things, that you would report in a normal open-source project – it’s an overall low level of quality. On the other hand these libraries have a lot of value, because the low-level algorithms will take even more time and especially know-how to implement, so just reusing them is obviously the right approach. Some libraries are even original research and so you just can’t write them yourself, without spending 3 years on a PhD thesis.

I cannot but mention Heartbleed here – OpenSSL is written by scientific people, and much has been written on topic that even OpenSSL does not meet modern software engineering standards.

But that’s only the surface. Scientists in general can’t write good code. They write code simply to achieve their immediate goal, and then either throw it away, or keep using it for themselves. They are not software engineers, and they don’t seem to be concerned with code quality, code coverage, API design. Not to mention scientific infrastructure, deployment on multiple servers, managing environment. These things are rarely done properly in the scientific community.

And that’s not only in computer science and related fields like computational linguistics – it’s everywhere, because every science now requires at least computer simulations. Biology, bioinformatics, astronomy, physics, chemistry, medicine, etc – almost every scientists has to write code. And they aren’t good at it.

And that’s OK – we are software engineers and we dedicate our time and effort to these things; they are scientists, and they have vast knowledge in their domain. Scientists use programming the way software engineers use public transport – just as a means to get to what they have to do. And scientists should not be distracted from their domain by becoming software engineers.

But the problem is still there. Not only there are bad libraries, but the code scientists write may yield wrong results, work slowly, or regularly crash, which directly slows down or even invisibly hampers their work.

For the libraries, we, software engineers can contribute, or companies using them can dedicate an engineer to improving the library. Refactor, cleanup, document, test. The authors of the libraries will be more than glad to have someone prettify their hairy code.

The other problem is tougher – science needs funding for dedicated software engineers, and they prefer to use that funding for actual scientists. And maybe that’s a better investment, maybe not. I can say for myself that I’ll be glad to join a research team and help with the software part, while at the same time gaining knowledge in the field. And that would be fascinating, and way more exciting than writing boring business software. Unfortunately that doesn’t happen too often now (I tried once, a couple of years ago, and got rejected, because I lacked formal education in biology).

Maybe software engineers can help in the world of science. But money is a factor.

Recently I’ve been trying to get a bit into music theory, machine learning, computational linguistics, so I ended up looking at libraries and tools written by the scientific community – examples include the Stanford Core NLP library, GATE, Weka, jMusic, and several more.

The general feeling is that scientific libraries have mostly bad code. I will not point fingers, but there are too many freshman mistakes – not considering thread-safety, cryptic, ugly and/or stringly-typed APIs, lack of type-safety, poorly named variables and methods, choosing bad/slow serialization formats, writing debug messages to System.err (or out), lack of documentation, lack of tests.

Thus using these libraries becomes time consuming and error prone. Every 10 minutes you see some horribly written code that you don’t have the time to fix. And it’s not just one or two things, that you would report in a normal open-source project – it’s an overall low level of quality. On the other hand these libraries have a lot of value, because the low-level algorithms will take even more time and especially know-how to implement, so just reusing them is obviously the right approach. Some libraries are even original research and so you just can’t write them yourself, without spending 3 years on a PhD thesis.

I cannot but mention Heartbleed here – OpenSSL is written by scientific people, and much has been written on topic that even OpenSSL does not meet modern software engineering standards.

But that’s only the surface. Scientists in general can’t write good code. They write code simply to achieve their immediate goal, and then either throw it away, or keep using it for themselves. They are not software engineers, and they don’t seem to be concerned with code quality, code coverage, API design. Not to mention scientific infrastructure, deployment on multiple servers, managing environment. These things are rarely done properly in the scientific community.

And that’s not only in computer science and related fields like computational linguistics – it’s everywhere, because every science now requires at least computer simulations. Biology, bioinformatics, astronomy, physics, chemistry, medicine, etc – almost every scientists has to write code. And they aren’t good at it.

And that’s OK – we are software engineers and we dedicate our time and effort to these things; they are scientists, and they have vast knowledge in their domain. Scientists use programming the way software engineers use public transport – just as a means to get to what they have to do. And scientists should not be distracted from their domain by becoming software engineers.

But the problem is still there. Not only there are bad libraries, but the code scientists write may yield wrong results, work slowly, or regularly crash, which directly slows down or even invisibly hampers their work.

For the libraries, we, software engineers can contribute, or companies using them can dedicate an engineer to improving the library. Refactor, cleanup, document, test. The authors of the libraries will be more than glad to have someone prettify their hairy code.

The other problem is tougher – science needs funding for dedicated software engineers, and they prefer to use that funding for actual scientists. And maybe that’s a better investment, maybe not. I can say for myself that I’ll be glad to join a research team and help with the software part, while at the same time gaining knowledge in the field. And that would be fascinating, and way more exciting than writing boring business software. Unfortunately that doesn’t happen too often now (I tried once, a couple of years ago, and got rejected, because I lacked formal education in biology).

Maybe software engineers can help in the world of science. But money is a factor.