Databases is one of the few areas in computing where there is a noticeable awareness of models. At least, models are talked about. But actually quite few of the people who work with database systems have more than a very vague view of what data models are or what purpose they serve.
Photo by Marco Arment (some rights reserved).
I explained my view of what a model is in the previous post. A data model then, is a model for dealing with general-purpose data. Data as such is not that common in most computer users' experience these days. People usually have no need to think of the data that carries the information they work with, and systems are kind enough not to bother the users with the low-level concept of data. Only systems that deal with any kind of data, regardless of what it represents, need to include the notion of data in itself in the model they expose to the user. Particularly, data is a concern for database systems – which support intricate operations on general-purpose data. Hence the importance of data models in this context.
The Dominating Data Model
Out of the plethora of data models in use by database systems (some more vaguely defined than others), the most popular one, to the extent that we can skip all the others for the time being, is the relational model. Or perhaps I should say subset of the family of relational models – for there are several. Even the instigator of the term, E.F. Codd, did not present a single coherent model, but seems to have invented a new slightly different variant practically every time he presented it. For this post, however, it will do to to talk about the relational model as if it was one.
An average database professional trying to explain the relational model typically starts by stating that in the relational model, data are stored in tables. Not a good start. First of all, tables do not really belong in the model, they are visualizations of relations – but that is a minor point.
More importantly, the model says nothing of storage! It acts as an interface, for direct users and for surrounding systems that interact with the database. Behind the scenes of the relational model, the system can deal with storage in whichever way it pleases – or whichever way the database administrator configures it. There are endless possibilities for data structures and access methods
Photo by Phil Aaronson (some rights reserved).
The implementers of early database systems, such as System R, chose to use a simple one-to-one mapping between the model and the physical storage. It was a natural choice as a first attempt. What they were doing was trying things out, not creating the ultimate system.
But somehow, the idea of a simple connection between model and storage got stuck in people's minds, and in the major database systems on the market.
Logical–Physical Separation
Recall the inverted text example from a previous post. Here is a variant with the less interesting words removed, and with only the document numbers where the words appear:
Word | Doc. no |
---|---|
art | 1 |
art | 3 |
war | 1 |
war | 2 |
modern | 3 |
peace | 2 |
Many people would say that this is an inefficient representation, because it contains multiple copies of the words that appear in more than one document. To make the representation more compact, they might try using a list of document numbers as the second column, perhaps specified as a “LOB” – a large object.
This is exactly what major database vendors do for their text indexes, which is one of the many choices they make to go against the basic idea of the relational model. Unnecessarily so.
Because, again, the model is not the storage. It is the user interface. The relational structure is not necessarily the same as that in the physical representation. The following figure illustrates how the logical model representation (at the top) can be tied to a compact physical disk storage (bottom).
This is just one possible representation, of course. Ideally, the physical representation should be chosen so that it makes critical operations efficient in the particular application where the database is to be used. The relational model still lets users access data in any way they please, but queries that the chosen implementation is not directly suited for may be execute less efficiently.
A problem with the basic design of major relational database systems is that they do not have enough information to make a good choice of representation. It is impossible for them to obtain the information about the application that they would need.
The Basis of Misunderstanding
People get the wrong idea about the role of the data model partly because they start using systems without learning the basic theory, but another reason, which I consider more critical, is how the user is forced to interact with the system in order to get reasonable performance. (By user in this case, I mean the person who uses the database system for implementing an application.)
In a perfect world, the way it is supposed to work is that the user provides logical definitions of the data, and the system somehow magically chooses the best implementation. This is a tough task, not least because the system must choose the implementation before it has had a chance of getting any feedback as to its use. In practice, this simply does not work.
Hence, the user, who does have an idea about which operations need to be efficient, needs to convey more information to the system than just the logical definitions. (Or move to a specialized database system for the application, if there is one.) And the only way that the user can interact with the system is through the relational interface!
This has two effects. First, database system vendors add physical aspects to the relational interface that are really not relational or logical at all. SQL is full of this – indexes for example. There are no indexes in the relational model; they belong in the implementation.
Second, the user is led into using implicit couplings between the relational definitions and the physical storage for improving performance. Denormalization is a blatant example. Relational theory tells us that that the relational structure of the database should be normalized to avoid integrity problems. Yet, practitioners choose to do the opposite because it can give them a performance advantage.
The result of all this is that it is difficult to spot the actual relational model in the major relational database systems.
Throwing Out the Model
People have started to move away from relational database systems, mainly because of performance issues. A range of more specialized systems for narrower applications is emerging. I have no principle objection to that; using the same database system for everything is no end in itself.
However, I am convinced that there will still be a place for systems that are general-purpose enough to explicitly expose data to the user, without narrowing down what the data may represent, and accept complex specifications on relationships and retrieval operations. For this, a simple logical model that represents data as relations is extremely powerful.
Unfortunately, the relational model has become so closely associated with SQL that people tend to discard the model because they are not satisfied with SQL. The end of the relational era is proclaimed, when it should really be just the end of the SQL era.
I agree with the relational evangelists that the reason why SQL databases are so displeasing is not that they are relational. Rather, one of their shortcomings is that they are not really relational. However, I am not convinced that the third manifesto systems that the true relational lobby brings forward can take their place. The relational model itself does have some drawbacks, which I leave as a subject for future posts.