Migrating data to a new db schema with Django 1.2
This week I had to make a migration of data on a database from an old schema to a new one. The database is part of a Ruby on Rails application, so the changes are part of new features, and we have also taken the opportunity to clean up a little the database and make some changes to be more flexible in the future. As we want to be consistent, we need to migrate all the old data to the new database.
After talking with people with more experience than me on Rails (I have only use it for this project) about how to perform a migration, and as this week the brand new Django version, supporting multiple DBs was release, I decided to use the Django ORM to perform it.
My initial idea about the multiple database support on Django was that each of the models will have some kind of meta information that will determine which database if going to use. So, the idea will be to create models for the old database and models for the new database, each one with its own meta information.
Well, Django doesn’t work exactly this way… You can use that approach if each of the tables on the databases are named differently, because is smart enough to know that a table is only on one database, but the problem was that some of the tables we are using keep the same name, but change the way the information is stored.
In fact, the Django approach it’s more powerful than that, and allow a lot of different techniques, but you have to make some code. The key point is using a ‘router’. A router is a class that, using standardized methods, will return the appropriate database to look when you’re going to read, write, make a relationship or sync the db, according to the model performing the operation. As you can write those methods, you can basically do whatever you can imagine on the databases. For example, write always to the master database and read from a random (or consecutive) slave database. Or write the models of one application on one database and of the other three applications on another.
The router class then is added to the settings.py file. You can even generate several routers and apply them in order.
Getting the models
As the database model hasn’t been designed using Django, but Ruby on Rails, I had to connect to the databases (old and new) and let Django discover the models for me. The easy part is to generate the first models, just using
python manage.py inspectdb --database=DATABASE
Specifying the correct database, and storing each results in two different files, one for the old models and another for the new models (I have called them, in a moment of original inspiration as new_models.py and old_models.py). Then, rename each model to begin with Old or New, so each model name is unique. Then I created a models.py file that will import both, to follow Django conventions. I could also combine both, but having both models as different files seems to make more sense to me.
Then, as you can imagine, the problems began.
First, there is one table that has a composed primary key. Django doesn’t like that, but, as that table is new and doesn’t need the old data, I have just ignored and delete the model.
Another problem is that Rails doesn’t declare relationships as, well, relationships. It doesn’t create the fields as foreign keys on the database, but just as plain integers, and then the code will determine that there are relationships. So, as Django analyze the database, it will determine that the codes are not foreign keys, but plain integers. You have to manually change all those integers to foreign keys to the correct table, if you want to use the ORM properly. Apparently there are some plugins for Rails to define the relationships as foreign keys on the database.
To add a little confusion, Rails declare the names of the tables as plurals (for example, for an model called ‘Country’, the table will be called ‘Countries’), so the name of the models will be plural. I’m used to deal with singular model names in Django, so I tend to use the singular name instead of the plural when using the models, which will raise an error, of course. Anyway, you can avoid it changing the name in the models.
Routing the models
The router is easy, it will just route a model depending on the first letters on the model name. Models beginning with ‘New’ will go to the new database and every other model will go to the old (default) database, both to write and to read. I have started old models with ‘Old’. So the code is like this:
class Router(object): """A router to control all database operations on models in the migration. It will direct any Model beggining with 'New' to the new database and any with 'Old' to the default database""" def db_for_read(self, model, **hints): if model.__name__.startswith('New'): return 'new' return 'default' def db_for_write(self, model, **hints): if model.__name__.startswith('New'): return 'new' return 'default'
To avoid problems, the access to the old database is made with a read-only database user. That will avoid accidentally deleting any data.
The migration script imports the Django settings and then mix all the data from the old database and then generate the new data for the new database. Using the Django ORM is easy, but there are some problems.
Django ORM is slooooooooow. Really really slow, and that’s a bad thing for migrations™ as they usually have lots of stored data. So there are some ideas to keep in mind:
- Raw copy of tables can be performed using raw SQL, so try to avoid just copying from one table in the old database to the same table in the new database using Django, as it can take lots of time, and I mean LOTS. I began copying a table with about 250 thousands records. Time with Django, over 2 hours, time dumping in SQL, about 20 seconds.
- Use manual commits, if the database allows it. It’s not a “magical option”, it’s still slow, but can help.
- As usually the migration will be performed only once, try to work on development with a small subset of the information, or at least try to import one table at a time, and don’t recreate it once it’s on the new database. When you’re happy with your code, you can run it again from the beginning, but it’s awful to wait 5 minutes to realize that you have a typo error on one line, and another 5 minutes to discover the next typo error two lines below that.
Another thing to keep in mind is that the relationships are not shared over databases, so you need to recreate them. For example, imagine that we have this two models, when we store comic book characters. The Publisher table is going to keep the same shape, but the character table will now include the secret identity name.
class NewPublisher(models.Model): id = models.IntegerField(primary_key=True) name = models.CharField(max_length=50) class Meta: db_table = u'publisher' class OldPublisher(models.Model): id = models.IntegerField(primary_key=True) name = models.CharField(max_length=50) class Meta: db_table = u'publisher' class NewCharacter(models.Model): id = models.IntegerField(primary_key=True) publisher = models.ForeignKey('NewPublisher') nickname = models.CharField(max_length=50) secret_identity = models.CharField(max_length=50) class Meta: db_table = u'character' class OldCharacter(models.Model): id = models.IntegerField(primary_key=True) publisher = models.ForeignKey('OldPublisher') name = models.CharField(max_length=50) class Meta: db_table = u'character'
The publisher table is identical, so the private keys are the same. Let’s say that all the secret identities are going to be “Clark Kent”, so the code to migrate the data will be like:
for old_character in OldCharacter.objects.all(): new_character = NewCharacter(nickname=old_character.name, secret_identity='Clark Kent', publisher=NewPublisher.objects.get(pk=old_character.publisher.id)) new_character.save()
You cannot use the relationship directly, and say that publisher = old_character.publisher, because that will try to assign an OldCharacter to a field that should be a NewCharacter. Django will raise an exception, but it’s good to keep that in mind. All those checks will help in the end to have a better control over the data in the new database and will ensure that all the data is consistent.
Migrate data from one database to another is always painful. One can argue that it SHOULD be painful, as you’re dealing with lots of information that should remain in good shape and that should always been taken with respect and caution.
Having that into mind, I must say that Django has made it a little less painful. I think also that the functionality for multi-db support it’s quite powerful and can be adapted to several uses. One thing that has always been more complicated that it should and now can be quite easy is migrating from one kind of database to another (from MySQL to Postgres, for example), keeping the data intact.
Anyway, I still think that including some kind of meta information to specify the database (or database behavior) per model could be a good idea.
But, by far, the worst problem is the way that Django is slow working with large sets of information. Adding some kind of bulk insert will be a huge improvement of the framework. Yes, you can always read the data from the database using the Django ORM and compose the INSERT statements by hand on a file to then load them, which is several orders of magnitude faster, but the key point of using the ORM should be not having to use SQL.
SQLAlchemy has neat features such as a low-level expression language, bulk inserts and the like that might make the migration process a little less painful (though as you say some pain is inevitable).
I have tried a couple of times SQLAlchemy and, while it’s easy to use, I am not really comfortable with its approach. Well, not as much as with Django one, at least. It’s closer to SQL and a relational database than Django ORM, which is more Class-Oriented and help you to not think about the database, but the data.
That’s my opinion, of course.
Anyway, it’s good to know that there are alternatives, and I will take that mind for the future. Do you know any performance comparisons between both systems?
I didn’t mean to replace Django ORM in your application code, but to use it in cases like reporting or migrations, where you need that lower level functionality. Case of best tool for the job.
In terms of performance, the expression language (which is what I’d use in this case; no need for the ORM) translates to almost exactly the SQL you want, so as long as you know your SQL you should be OK. The overhead is quite minimal.