People really love RDBMSs
I had this discussion with a friend, helping him with a personal project. It was a software to help with a weekly schedule, so it has some teachers, each ones with a profile, some alumns, each one with a profile, and classes, relating the teachers with the alumns with time and physical place.
My friend has deal work a lot with RBMS and as a DBA, mostly with Oracle and MySQL, and he his quite good at that. He began designing a relational database model, with tables for alumns, teachers, places, etc… He also wanted to learn Python, so he wanted to access the database with SQLAlchemy and installed a MySQL database for development.
This development was intended for a small academy, so it will run on only one computer. There is no need of making some client-server architecture or any concurrency. Just one application you can start, load the data, change it, and then close and store the data for the next time you need to open it again.
So, basically, what you got here it is making some classes to define objects. Those objects with the information, using SQLAlchemy, will be stored on a MySQL database, and accessed just using the SQLAlchemy interface.
To me it’s clearly overdesign.
What it’s the point in store the data on a RDBMS? You don’t need all the good stuff a relational database bring at all. You don’t need transactions as your data is only accesed by one single application. You don’t need to share the data over a network. You don’t have to standarize the data to be accessed from different languages or clients. Or make a clustered database… Sure, relational databases are great and are used in lots of applications, but in this particular one you’r not getting any real advantage in using one. Instead, you’re creating a data model using classes and objects in Python. We only want that information to be persistent, so we can close the application and open it again and have the same data.
The most appropiate tool in this case, for me, it’s just plain serialization. Just generate an object with all the data and pickle it on one file before closing the application. If we need extra care in case of error, just do it each time something changes. If the data is expected to be huge (not on this case), you can do it on different files. The same if you expect a lot of search.
You can argue that , well, the application can grow, and be distributed, and THEN you can use all the fancy MySQL features. And that’s true. So you need to define a data model than THEN can be adapted using SQLAlchemy or other ORM (I really like the Django one). It’s really not so difficult. But complicate your system in advance is something I like to avoid.
I like to think that we should design a data model, and then (and not before) think how to implement this data model on a particular way (memory, files, database, etc… ). Of course a relational database it’s a lot of time the good solution, but it’s not the only one available.
EDITED: After some comments (which I really appreciate), I have to say that I can be really easily convinced to use SQLite. My position was more “I think that pickling the data is enough, but if you feel more confortably, use SQLite. I really think that MySQL it’s too much). My key idea on the post is to discuss that I think that we should think, even for a few seconds, if the use of a RDBMS is appropiate on every design, and consider alternatives… I think that a lot of designs begin with with MySQL ( or worse, with Oracle or MS SQL server) before even think about the data model… The DB should support the data model, not the data model be made to fit the DB…
A real DB will also maintain consistency a bit better than pickle files. Having to explain fsync for example to non computer people is painful. Otherwise I totally agree!
Although overdesign is always a danger, I think an RDBMS is appropriate here. It sounds like the data model includes relational data, and it’s always nice if you can store data in a manner that’s more universally accessible than via Python.
I’d be more concerned if the developer had no RDBMS experience, but that doesn’t seem to be the problem here. Although I do agree that MySQL seems like overkill when SQLite would do the trick.
I think that, other than in during development and testing, for this particular case, no one will ever manage the data other than from the application.
I suggested also SQLlite, as a way to have a RDBMS, but a small one… According to the specs of the projects, I think it’s not necessary, but I can be convinced to use it 😛 . After all, it’s easy to setup from an ORM, but MySQL is completelly overdesign…
Anyway, I think that the key point, other than this particular problem, is not to be restricted to always use the same tools, and think on the data model, which not always will mean a big RDBMS…
Or you could still use SQLAlchemy coupled with SQLite.
That way, if you ever need to grow over to a beefier RDBMS, you just need to tune the SQLAlchemy layer, not your app’s whole persistence layer.
Remember to keep an eye on future expansion, even when unlikely.
Of course, don’t design a Space Shuttle for yet another simple CRUD either.
You have to keep in mind that you’ll most likely will need to back up the data at sometime. RDBMS of all sizes already have some kind of simple querying app that allows you to interact with the data without using your client app.
All that being said, pickle rocks.
Backup a pickle file is just making a copy of the file… You can integrate it with a “general” backup of the computer.
Another vote for sqlalchemy with sqlite, but to keep it really simple use Elixir (http://elixir.ematia.de/trac/wiki). Elixir is a wrapper for sqlalchemy that simplifies all the legwork for small projects. Just read the one-page tutorial on their site and see how really simple it is to map your classes to database tables, create the db schema etc.
I’m not affiliated with Elixir in any way, but having used the package in at least three simple projects, I have to tell the world… It’s the only thing I need to get a small project quickly off the ground.
Pickle is a catastrophically bad data persistence format. Ignore the fact that it’s slow and insecure, and you’re still left with the problem that it’s *extremely* fragile and that the failure mode is for all your data to be unrecoverable without manual intervention.
It sounds like you’re recommending something you’ve never actually tried in a serious application yourself. If you *have* done this and been successful, then you should probably *also* be writing about all of the little tricks you had to learn in order to be successful at it, since these are far from obvious.
Well, maybe we’re talking about different things, in a way that I cannot consider this application a “serious” one, as it is very simple and perform non-critical operation. Of course, a regular backup is something you have to do to avoid complete loss of data.
If I have use this approach in something, yes, I’ve done that, for example storing pickling information on a spreadsheet using pyUNO. The application was a macro on OpenOffice that will read some parameters on the spreadsheet and present a series of dialogs to allow input data, then that data should be stored and signed to load that data on the central database. I this case we want to avoid storing information is different files, as we want all the “application” self-contained on the spreadsheet, so we pickled the data and store it on a sheet. The data was protected and signed to avoid inconsistances (our problem, as it’s not public information is avoid unconsistencies). In this case we have some problems storing the pickled data as there were some data that cannot be directly pickled (custom dynamic classes), so we have to do some “custom” pickling, and of course the problem that the cells on a spreadsheet has a limited size and we have to split the pickled data across a lot of cells…
I’ve also used it on a small personal project trying to automatically filtrate RSS links using a boolean filter approach (interesting/not interesting). To store the filter (consisting on words and the probability for each to be used on an interesting link and other data as the number of training uses, etc) I began using a SQLite database, but it turned to be really slow to get the information, so I make a big dictionary with all the words and the associated info and store the filter on a file. The time was drastically reduced. Anyway, the filter doesn’t perform as good as I imagined (I think that kind of boolean filters are good for spam, but too naives for this inteligent processing), so I abandoned the project after a while…
(I think I’m learning a lot on this post, anyway…)
I certainly do agree but I have found that using BDB is far easier than using plain pickle and one gets a finely tuned and robust data storage. And it’s well supported in python. The down of this way is that the documentation is sparse and somehow lacking when you’re new but there’s plenty of projects using it.
You’re reffering to Berkley database, aren’t you?
Do you know a good tutorial? I’m interested about this approach…
I started from here but it took me also a couple of the bare c tutorials and reference to get the hang of it: http://pybsddb.sourceforge.net/reftoc.html
SQLite + SQLAlchemy seems like a much more “safe” route. Even if you were to go the route of pickle and flatfiles, like it has been pointed out, your data integrity would be much less assured. That is, unless you were to write a lot of boilerplate code.
Also as pointed out, this also gives you upwards mobility if it ever does become more than a single computer, single user thing.
A little update of the post after the comments, maybe I wasn’t clear 😉
Also, thanks everyone for your comments!
As others have mentioned, SQLite + SQLAlchemy is probably a good mix as you can switch RDBMS systems transparently (well, minus data migration) if you ever need to.
Another option to look at would be ZODB. Up until about a month ago, my blood reacted violently with any Python containing a “Z” in it.
I took a bit of time to learn how to use ZODB without the Zope framework, and it’s a surprisingly easy to use and clean system. I view it as the SQLite of the NoSQL world, if that makes sense. If you ever need to share it, you can simply export it using ZEO and you’re good to go.
He’ll also get a taste of the Zope world, which really makes sense to understand if he’ll be getting into Python (at least on some level).
I didn’t know about ZODB, I’ll take a look at it
Don’t overlook the advantages of SQL as an ad hoc query language. It will be possible to do reasonable data investigations with a relational database that have to be hand-crafted on the kind of back-end you are considering.
Most of the times there is also a hidden customer requirement to be able to get access to their data if they do not want to use this application anymore. That’s why I mostly prefer text-based file formats for storage or export. This way the data is easily available and can be used in another application. File formats like CSV and objects saved as JSON are very useful for this purpose.
If I remember correctly it is also a pain to upgrade/modify your python classes if you have pickled them to a file.
Yes, definitively stored information in plain text is very useful, and the csv module is very easy to use.
RDBMS is a hammer. It is a good one and can even nail in most screws.
connection = RDBMS.connections( connection_info_and_stuff )
cursor = connection.cursor()
cursor.execute(‘SELECT * FROM teachers’)
for entry in cursor.fetchrows():
do_stuff( entry )
cursor.execute(‘SELECT * FROM courses WHERE course_id = “%s”‘ % course_id)
course_info = cursor.fetchone()
( Speed: 300-400 operations per second )
db = anydbm.open( ‘database.db’ )
for (key, entry) in db.iteritems():
do_stuff( entry )
some_info = db[course_id]
( Speed: 2000-3000 operations per second )
JSON, pickle, parsed CSV:
db = dict([(y,y for y in [x.strip().split(‘,’) for x in open(‘data.csv’)]])
for (key, entry) in db.iteritems():
do_stuff( entry )
some_info = db[course_id]
( Speed: up to 50 000 operations per second )
So – decide for yourself.