Thursday, October 30, 2008

Models and Efficiency, part 2: Databases

by Jesper Larsson

Databases is one of the few areas in computing where there is a noticeable awareness of models. At least, models are talked about. But actually quite few of the people who work with database systems have more than a very vague view of what data models are or what purpose they serve.

Photo by Marco Arment (some rights reserved).

I explained my view of what a model is in the previous post. A data model then, is a model for dealing with general-purpose data. Data as such is not that common in most computer users' experience these days. People usually have no need to think of the data that carries the information they work with, and systems are kind enough not to bother the users with the low-level concept of data. Only systems that deal with any kind of data, regardless of what it represents, need to include the notion of data in itself in the model they expose to the user. Particularly, data is a concern for database systems – which support intricate operations on general-purpose data. Hence the importance of data models in this context.

The Dominating Data Model

Out of the plethora of data models in use by database systems (some more vaguely defined than others), the most popular one, to the extent that we can skip all the others for the time being, is the relational model. Or perhaps I should say subset of the family of relational models – for there are several. Even the instigator of the term, E.F. Codd, did not present a single coherent model, but seems to have invented a new slightly different variant practically every time he presented it. For this post, however, it will do to to talk about the relational model as if it was one.

An average database professional trying to explain the relational model typically starts by stating that in the relational model, data are stored in tables. Not a good start. First of all, tables do not really belong in the model, they are visualizations of relations – but that is a minor point.

More importantly, the model says nothing of storage! It acts as an interface, for direct users and for surrounding systems that interact with the database. Behind the scenes of the relational model, the system can deal with storage in whichever way it pleases – or whichever way the database administrator configures it. There are endless possibilities for data structures and access methods

Photo by Phil Aaronson (some rights reserved).

The implementers of early database systems, such as System R, chose to use a simple one-to-one mapping between the model and the physical storage. It was a natural choice as a first attempt. What they were doing was trying things out, not creating the ultimate system.

But somehow, the idea of a simple connection between model and storage got stuck in people's minds, and in the major database systems on the market.

Logical–Physical Separation

Recall the inverted text example from a previous post. Here is a variant with the less interesting words removed, and with only the document numbers where the words appear:

Word Doc. no
art 1
art 3
war 1
war 2
modern 3
peace 2

Many people would say that this is an inefficient representation, because it contains multiple copies of the words that appear in more than one document. To make the representation more compact, they might try using a list of document numbers as the second column, perhaps specified as a “LOB” – a large object.

This is exactly what major database vendors do for their text indexes, which is one of the many choices they make to go against the basic idea of the relational model. Unnecessarily so.

Because, again, the model is not the storage. It is the user interface. The relational structure is not necessarily the same as that in the physical representation. The following figure illustrates how the logical model representation (at the top) can be tied to a compact physical disk storage (bottom).

This is just one possible representation, of course. Ideally, the physical representation should be chosen so that it makes critical operations efficient in the particular application where the database is to be used. The relational model still lets users access data in any way they please, but queries that the chosen implementation is not directly suited for may be execute less efficiently.

A problem with the basic design of major relational database systems is that they do not have enough information to make a good choice of representation. It is impossible for them to obtain the information about the application that they would need.

The Basis of Misunderstanding

People get the wrong idea about the role of the data model partly because they start using systems without learning the basic theory, but another reason, which I consider more critical, is how the user is forced to interact with the system in order to get reasonable performance. (By user in this case, I mean the person who uses the database system for implementing an application.)

In a perfect world, the way it is supposed to work is that the user provides logical definitions of the data, and the system somehow magically chooses the best implementation. This is a tough task, not least because the system must choose the implementation before it has had a chance of getting any feedback as to its use. In practice, this simply does not work.

Hence, the user, who does have an idea about which operations need to be efficient, needs to convey more information to the system than just the logical definitions. (Or move to a specialized database system for the application, if there is one.) And the only way that the user can interact with the system is through the relational interface!

This has two effects. First, database system vendors add physical aspects to the relational interface that are really not relational or logical at all. SQL is full of this – indexes for example. There are no indexes in the relational model; they belong in the implementation.

Second, the user is led into using implicit couplings between the relational definitions and the physical storage for improving performance. Denormalization is a blatant example. Relational theory tells us that that the relational structure of the database should be normalized to avoid integrity problems. Yet, practitioners choose to do the opposite because it can give them a performance advantage.

The result of all this is that it is difficult to spot the actual relational model in the major relational database systems.

Throwing Out the Model

People have started to move away from relational database systems, mainly because of performance issues. A range of more specialized systems for narrower applications is emerging. I have no principle objection to that; using the same database system for everything is no end in itself.

However, I am convinced that there will still be a place for systems that are general-purpose enough to explicitly expose data to the user, without narrowing down what the data may represent, and accept complex specifications on relationships and retrieval operations. For this, a simple logical model that represents data as relations is extremely powerful.

Unfortunately, the relational model has become so closely associated with SQL that people tend to discard the model because they are not satisfied with SQL. The end of the relational era is proclaimed, when it should really be just the end of the SQL era.

I agree with the relational evangelists that the reason why SQL databases are so displeasing is not that they are relational. Rather, one of their shortcomings is that they are not really relational. However, I am not convinced that the third manifesto systems that the true relational lobby brings forward can take their place. The relational model itself does have some drawbacks, which I leave as a subject for future posts.

3 comments:

ericjs said...

I agree with most of what you say here. However there is one point I want to make in response to:

A problem with the basic design of major relational database systems is that they do not have enough information to make a good choice of representation. It is impossible for them to obtain the information about the application that they would need.

I don't see any reason why relational databases can't be self-optimizing, based on usage. Sun's Hotspot compiler does this. Patterns of usage allow it to learn how to optimize your Java bytecode.

An RDBMS ought to be able to gather information about patterns of queries it repeatedly sees, and how long they take, and take measures to speed them up. It could, for example, be creating what are essentially denormalized structures, even what are essentially materialized views. We shouldn't even have to create indexes manually, there's no good reason the system shouldn't be able to figure out where indexes are needed.

And yes I realize the mess that is the SQL language would make this harder than it needs to be, but you would think after all the decades these systems have been around, at least a little progress would have been made in this direction.

Jesper Larsson said...

ericjs: That is a series of good points: an RDBMS should be capable of optimization based on statistics; SQL makes it hard; but at least some progress should be possible.

Then the question is why this does not happen. Even though vendors have been saying for many years that their systems are becoming more autonomic we have seen very little practical results. It could be that their system design makes it more difficult than it should be, even beyond the problems with SQL. Or perhaps economic strategy tells them it is not worth going at this with full force.

A factor against radical self-optimization may be how projects are typically deployed. People want a system to be in a stable state when it goes into production. It feels risky to have the system introduce new indexing etc. on the fly, and the feeling is not unfounded. Statistical self-optimization could still be useful during development of an application, but to really make a difference it has to do better than a human DBA could easily do. I think we will see a breakthrough when that happens, but I am not sure it will be in systems like today's SQL DBMSs.

But now I am mostly speculating…

ericjs said...

I think you're right, that internal design issues, and marketing concerns add to the difficulty.

As far as internal design, I did keep my examples to things that essentially simulate what users have to do to manually optimize, though the sql interpretation is one design area that might still be troublesome. Even deeper optimizations around the actual storage of table data would be possible with deeper design changes, but that opens bigger cans of worms in terms of current designs.

But I think there is a deeper problem. RDBMS culture has gotten, well, old and stale. It has a musty smell about it that keeps a lot of new talent from getting into the field or even from investing very heavily as potential users of it. This is of course a vicious cycle, and has nothing to do with the relational model itself.

The whole job category of DBA is a good example. No other field of computing is structured like this. This pattern of having dedicated DBAs manage databases is a analogous to the old days of having computer operators doing many of thing things that today the operating system does automatically. It is certainly not in these people's interest for this to change, and they are the major consumers / purchase advisers of RDBMSes.

This, I think, is a major reason why so much energy gets invested in entirely new database approaches rather than into improving the RDBMS landscape.