Database Migrations And Java

There’s a trivial problem that every project has – keeping the database structure in sync with the application. That is, when your code relies on a new table or column, that table/column must appear in the database. This is especially a problem when there are multiple environments, on which the code gets deployed (production, stage, integration, local), as usually additional steps need to be performed in order to update the database schema.

The quickest way to achieve the desired effect is to provide SQL upgrade scripts every time the code needs a new database element (or no longer needs one). But that’s tedious, as these sql scripts have to be executed manually on each environment the code gets deployed to. Automating this process has been achieved by a couple of frameworks, so that you don’t need to do it. Here is a list of options for the Java world.

Why do we actually need such a solution? We aim for frequent, one-click deployments to staging and production environments. We can’t do that without an automated way of syncing the database (otherwise there will be missing columns, or obsolete non-null columns will prevent data insertion). And once in a while we may need to revert a release, and we’d have to rollback the database changes accordingly. While you can just drop and recreate the schema while developing locally, you can’t do that in production.

Rails developers will rightly say “We’ve had that for a long time, out of the box”. Indeed, rails migrations seem like a perfect implementation of the idea of keeping the application code in sync with the database schema. But I had to pick a technology for a Java project. Actually, it is a grails project, and even though there appears to be a grails plugin for that, I picked another one – Liquibase (note: the grails plugin is built ontop of liquibase). I’ll explain how to get it working with JPA in a separate post, but it seems to have everything rails offers: human-readable (XML) format for the upgrade, automatic rollback script execution, and also tools that can generate the upgrade scripts based on a database diff.

In fact, I’ve never used a liquibase-like solution for my personal projects. I use hibernate hbm2ddl.auto even in production, and sometimes manually fix stuff. Yes, that’s not a workable solution for a big project and it has risks. Virtually nobody recommends using `hbm2ddl.auto` in production, because it’s not perfect and it may mean losing data or downtime. For those unfamiliar with hibernate, hbm2ddl.auto is a configuration switch that makes hibernate (and ORM framework) create tables and columns that correspond to the entity mappings. In a sense, it is a way to automatically sync the application code with the database structure. And it can do that, because we have the expected database structure in our annotated POJOs (entities), in a structured way that can define exactly what tables, columns and keys we need. But, as I said, it’s not recommended for production (it is recommended to set it to “validate”, so that it can verify whether the database and the code are in sync).

But I’m an idealist. I want something even better than a good migrations tool like liquibase. We have the prerequisites – a structured way for our code to define what’s the schema it requires. That’s the JPA annotations (with some vendor extensions). We just need a tool that works as if it had a manually created sql upgrade script. It needs to:

  • create tables, columns, indexes, foreign keys
  • drop tables, columns, indexes, foreign keys
  • alter tables, columns, indexes, foreign keys
  • fix the data before/after the structural change
  • make sure no data is lost

hbm2ddl.auto does only the first, and can partly do the fourth. It will be relatively easy to implement the “drop” and “alter” functionality. How do we make sure no data is lost? Whenever a column or table is dropped, a new temporary table (e.g. prefixed with _hbm_temp) is created that stores the data from that column/table. If it happens that the data should not have been deleted, you can get it back. Otherwise, you can cleanup all the temp tables once in a while. I can’t see any obvious problem with that approach, and it will save a lot of time, effort and mistakes.

While on the topic, I can’t skip the main reason for this complexity – the fact that relational databases have a rigid schema. NoSQL database, on the other hand, are mostly schema-less. It eliminates the exceptions arising from missing columns or tables – they automatically “appear” in the database. However, unused fields stay there and pollute the database, which you should clean up manually. And lacking any database-imposed constraints means that they should be handled in the application code. So, there’s a tradeoff (as usual).

Overall, avoid having to execute sql scripts manually in order to fix the database. Always aim for the application to be the main point, and make everything else (the schema) follow. This can’t happen always (e.g. if you need database triggers and stored procedures, or you have multiple applications using the same database, that can’t share their codebase), but at least put these scripts in the same repository as the code and make it possible to have quick deployments to any environment, with as little manual tasks involved as possible.

There’s a trivial problem that every project has – keeping the database structure in sync with the application. That is, when your code relies on a new table or column, that table/column must appear in the database. This is especially a problem when there are multiple environments, on which the code gets deployed (production, stage, integration, local), as usually additional steps need to be performed in order to update the database schema.

The quickest way to achieve the desired effect is to provide SQL upgrade scripts every time the code needs a new database element (or no longer needs one). But that’s tedious, as these sql scripts have to be executed manually on each environment the code gets deployed to. Automating this process has been achieved by a couple of frameworks, so that you don’t need to do it. Here is a list of options for the Java world.

Why do we actually need such a solution? We aim for frequent, one-click deployments to staging and production environments. We can’t do that without an automated way of syncing the database (otherwise there will be missing columns, or obsolete non-null columns will prevent data insertion). And once in a while we may need to revert a release, and we’d have to rollback the database changes accordingly. While you can just drop and recreate the schema while developing locally, you can’t do that in production.

Rails developers will rightly say “We’ve had that for a long time, out of the box”. Indeed, rails migrations seem like a perfect implementation of the idea of keeping the application code in sync with the database schema. But I had to pick a technology for a Java project. Actually, it is a grails project, and even though there appears to be a grails plugin for that, I picked another one – Liquibase (note: the grails plugin is built ontop of liquibase). I’ll explain how to get it working with JPA in a separate post, but it seems to have everything rails offers: human-readable (XML) format for the upgrade, automatic rollback script execution, and also tools that can generate the upgrade scripts based on a database diff.

In fact, I’ve never used a liquibase-like solution for my personal projects. I use hibernate hbm2ddl.auto even in production, and sometimes manually fix stuff. Yes, that’s not a workable solution for a big project and it has risks. Virtually nobody recommends using `hbm2ddl.auto` in production, because it’s not perfect and it may mean losing data or downtime. For those unfamiliar with hibernate, hbm2ddl.auto is a configuration switch that makes hibernate (and ORM framework) create tables and columns that correspond to the entity mappings. In a sense, it is a way to automatically sync the application code with the database structure. And it can do that, because we have the expected database structure in our annotated POJOs (entities), in a structured way that can define exactly what tables, columns and keys we need. But, as I said, it’s not recommended for production (it is recommended to set it to “validate”, so that it can verify whether the database and the code are in sync).

But I’m an idealist. I want something even better than a good migrations tool like liquibase. We have the prerequisites – a structured way for our code to define what’s the schema it requires. That’s the JPA annotations (with some vendor extensions). We just need a tool that works as if it had a manually created sql upgrade script. It needs to:

  • create tables, columns, indexes, foreign keys
  • drop tables, columns, indexes, foreign keys
  • alter tables, columns, indexes, foreign keys
  • fix the data before/after the structural change
  • make sure no data is lost

hbm2ddl.auto does only the first, and can partly do the fourth. It will be relatively easy to implement the “drop” and “alter” functionality. How do we make sure no data is lost? Whenever a column or table is dropped, a new temporary table (e.g. prefixed with _hbm_temp) is created that stores the data from that column/table. If it happens that the data should not have been deleted, you can get it back. Otherwise, you can cleanup all the temp tables once in a while. I can’t see any obvious problem with that approach, and it will save a lot of time, effort and mistakes.

While on the topic, I can’t skip the main reason for this complexity – the fact that relational databases have a rigid schema. NoSQL database, on the other hand, are mostly schema-less. It eliminates the exceptions arising from missing columns or tables – they automatically “appear” in the database. However, unused fields stay there and pollute the database, which you should clean up manually. And lacking any database-imposed constraints means that they should be handled in the application code. So, there’s a tradeoff (as usual).

Overall, avoid having to execute sql scripts manually in order to fix the database. Always aim for the application to be the main point, and make everything else (the schema) follow. This can’t happen always (e.g. if you need database triggers and stored procedures, or you have multiple applications using the same database, that can’t share their codebase), but at least put these scripts in the same repository as the code and make it possible to have quick deployments to any environment, with as little manual tasks involved as possible.