Tuesday, March 17, 2009

The Relational Model Is Not Logical

by Jesper Larsson

Look up the relational model on Wikipedia, and the first sentence that hits you (in the current version at least) is “The relational model for database management is a database model based on first-order predicate logic, first formulated and proposed in 1969 by Edgar Codd.” It would be nice if that were true. If the dominant model for databases had really taken its starting point in logic, the database field might not be such a mess today.

Although there is a clear connection between logic and the relational model, the assertion that the model is based on logic is at the very least debatable.

Relational model opening scrollArt by ctail (some rights reserved).

Logic and Relations

The brilliant idea at the core of the relational model is to use the logical/mathematical concept of relations as the single concept to describe data. In other models, there are typically at least two concepts: entities (or objects or whatever) and relationships (or links, connections, etc.). But in the relational model, one concept is enough.

Let us think about a typical real-life example of something that can be modeled as a relation: fatherhood. Fatherhood is a relation between two domains of objects: the set of all possible fathers and the set of all people. If we name the set of fathers F, and use X to denote an element of F (i.e., some father); and similarly call the set of people P and an element of it Y, then we can express the relation something like this:

X father_of Y

Mathematicians and logicians usually prefer single-letter symbols, and mostly write in prefix notation, so let us shorten father_of to f and write:

f(X, Y).

We call f a relation symbol, following Hodges1.

There is a point with using single-letter symbols instead of descriptive names, apart from saving space. It stresses that the symbols, as far as their use in logic is concerned, are really just that: abstract symbols that follow formal logical rules that we can define in whatever way we see fit. The meaning of the symbols is really beyond the scope of the logic. Connections between logical statements and the outside world can only be made through interpretations – attaching real-world concepts to the symbols.

For instance, we can take the statement “X is the father of Y” as the meaning of relation symbol f with arguments X and Y. Such a statement, used to attach a meaning to a relation symbol, is called a predicate. (At least according to some authors. There are other uses of the word predicate, a common one being to use predicate more or less in place of relation symbol.)

A more direct way to correlate the relation symbol f with the real world, without relying on any meaning to be evaluated by human judgment, is to provide the exact values for which f evaluates to true. This set of values is a subset of the Cartesian product F × P of the two sets F and P, and an interpretation of f can be given in the form of such a subset of F × P – a set of ordered pairs of values for which f evaluates to true.

For instance, one interpretation could be:

{ (anakin, leia), (george, george_w), (frank, dweezil) }.

There you have an explanation for the common way to define the term relation in mathematics simply as a subset of a Cartesian product.

A basic idea of the relational model is that sets such as these – relations, or extensions of the corresponding predicate – should be the only concept used for representing all data in the database, although as we shall see, it commonly is not.

The values anakin, leia, george, etc. are supposed to be objects from the “real world”, but when displaying the value like this we obviously have to use some sort of representative (in the form of character strings) rather than the actual objects (people). It is to break out of traps like this one that texts on logic sometimes contain peculiar assertions like “constants are interpreted as themselves.”

The parenthood relation is a binary, or 2-ary, relation; and its elements are pairs, or 2-tuples. In general, a relation symbol can be n-ary for any natural number n. For instance, we can write

p(X, Y, Z)

and let p be interpreted by, for instance, the predicate “X is a person who was born in city Y in the year X”, or by the relation

{ (george, milton, 1924),
(george_w, new_haven, 1946),
(frank, baltimore, 1940),
(dweezil, los_angeles, 1969) }.

This demonstrates the elegant uniform-concept idea of representing data as relations. In a model that has both entities and relationships, the person with birth city and year data would typically be perceived as an entity, while parenthood would be a relationship between records of the person entity. In the relational model, both sets of data are on the same basic form: relation. A fatherhood relation between two person object domains, and a person relation among the three domains person name, city, and year.

Where is the logic?
Photo by Peter Lindberg (some rights reserved).

Non-Logical Data

A convenient way of visualizing a relation, common in association with the relational database model, is to use a table:

parent_of
father (possible_father)child (person)
anakinleia
georgegeorge_w
frankdweezil

In the relational database model, the heading (schema), which is comprised of a name and data type for each column in the table (or in an alternative terminology, for each attribute in the relation). These metadata are not percieved as actual data in the relational database model, even though it is something that the user of the database system is expected to provide.

In addition, relational database systems typically comprise constraints of one kind of another. These are also supplied in ad hoc, non-relational, ways.

So, the relational database model breaks its own principle, that all data is represented in the form of relations, by introducing several additional forms of data. Classical logic finds no need for these special forms.

Algebra vs Logic

In his original account of the relational model,2, 3 Codd did briefly suggests that a data language based on first order predicate calculus should be developed, but, unfortunately, that did not make much of an impression on the database community. The paper also contained the embryo of another data language: relational algebra, a set of “operations on relations” which was later modified and expanded, and inspired various database languages. The languages grew more and more ad hoc and complex, and distanced from the ideas of logic. During the 1970s, the database language SQL was developed, with at least some involvement from Codd, and today SQL in various forms is the king of illogical relational languages.

Obviously, many people have noted the complexity and overhead problems of the relational systems in use today. Unfortunately, many draw the incorrect conclusion that the problems are in the relational model as such, rather than the implementations being stuck in old ways, partly caused by inadequacies buried in the SQL language. Hence, they choose to revert back to pre-relational tree-, network-, or record-oriented schemes.

Some however – particularly C. J. Date and Hugh Darwen, sanctioned by the whining but brilliant grump Fabian Pascal – have made heroic attempts to bring the relational model back to its roots. They have defined their own relational language dubbed “D”, which throws out the worst atrocities of SQL, and strengthens the declarative aspect.

They do not, however, go all the way to the elegance and simplicity of classical logic. (Although they have, as an aside in some of the Third Manifesto writings, taken the first step towards doing so with the alternative algebra “A”.) That is a pity, because expressing data in pure logical form makes for an excellent model. In logic, there is nothing special with concepts such as data types, constraints, comparisons, or arithmetic operations. They are just special cases of the basic simple constructs. Arguably, logic can provide the simplest model possible that allows defining and accessing databases, avoiding a lot of the awkwardness that comes from the directional operations of relational algebra.

Others have realized this, of course. As early as in the 1970s there was substantial research on the connection between logic and databases. Look back, and you will find pioneer work by, among others, Jack Minker and Raymond Reiter, and a lively research community. But for some reason, this was mostly not considered database research, but classified as artificial intelligence or logic programming, and database people appear to have taken little notice. The field more or less died out in the late 1990s, without having come through to the industry or database community.

Simplicity Enables Efficiency

Taking a logical approach to databases is not just an exercise in mathematical aesthetics. One of the most important points is simplifying the interface to database management core, in keeping the interface to higher levels of the system small. This takes a substantial workload off the shoulders of the system implementer.

In particular, a logical description of operations simplifies formulation of optimization tasks of the database system. Query optimization for SQL, with all its complexity, quirks, and inconsistencies, is a nightmare. Furthermore, for various reasons (which I hope to get back to in a future post), the actual operation of database applications using SQL systems are mostly not expressed in the language of the database at all. This takes the job of capable overall optimization from the level of practically impossible to theoretically impossible.

Optimization of database operations on the basis of pure logic, on the other hand, allows the system to concentrate on the actual computational complexity issues of translating logical formulae into execution plans. Not that this makes it easy, but arguably it makes it as simple as theoretically possible.

Revival?

The Claremont Report on Database Research, a document that has come out of a meeting of a number of distinguished database specialists in May 2008, mentions that “new declarative languages, often grounded in Datalog, have recently been developed for a variety of domain-specific systems.” Datalog is a Prolog-like language for database access, which dominated the database logic research during the 1980s, so this statement could be taken as an indication that a revival for logic in database access is on the horizon. I suppose our platform, Apptus Theca 2.0, can be said to be one such system, although our current interface is based on a graphical user interface rather than a language in text form.

I have not, however, come into direct contact with any such projects apart from our own, neither academic or commercial. If you are involved in, or aware of, anything like this, I would love to hear from you.

References

1 Wilfrid Hodges, Classical Logic I: First-Order Logic, The Blackwell Guide to Philosophical Logic (Lou Goble, ed.), Blackwell Publishers, 2001.
2 Edgar F. Codd, Derivability, Redundancy, and Consistency of Relations Stored in Large Data Banks, IBM Research Report, 1969.
3 Edgar F. Codd, A Relational Model of Data for Large Shared Data Banks, Communications of the ACM 13 (1970), no 6, 377–387.