Look up the relational model on Wikipedia, and the first sentence that hits you (in the current version at least) is “The relational model for database management is a database model based on first-order predicate logic, first formulated and proposed in 1969 by Edgar Codd.” It would be nice if that were true. If the dominant model for databases had really taken its starting point in logic, the database field might not be such a mess today.

Although there is a clear connection between logic and the relational model, the assertion that the model is

*based*on logic is at the very least debatable.

Art by ctail (some rights reserved).

**Logic and Relations**

The brilliant idea at the core of the relational model is to use the logical/mathematical concept of relations as the single concept to describe data. In other models, there are typically at least two concepts:

*entities*(or objects or whatever) and

*relationships*(or links, connections, etc.). But in the relational model, one concept is enough.

Let us think about a typical real-life example of something that can be modeled as a relation:

*fatherhood*. Fatherhood is a relation between two domains of objects: the set of all possible fathers and the set of all people. If we name the set of fathers F, and use X to denote an element of F (i.e., some father); and similarly call the set of people P and an element of it Y, then we can express the relation something like this:

X father_of Y

Mathematicians and logicians usually prefer single-letter symbols, and mostly write in prefix notation, so let us shorten father_of to f and write:

f(X, Y).

We call f a

*relation symbol*, following Hodges

^{1}.

There is a point with using single-letter symbols instead of descriptive names, apart from saving space. It stresses that the symbols, as far as their use in logic is concerned, are really just that: abstract symbols that follow formal logical rules that we can define in whatever way we see fit. The

*meaning*of the symbols is really beyond the scope of the logic. Connections between logical statements and the outside world can only be made through

*interpretations*– attaching real-world concepts to the symbols.

For instance, we can take the statement “X is the father of Y” as the meaning of relation symbol f with arguments X and Y. Such a statement, used to attach a meaning to a relation symbol, is called a

*predicate*. (At least according to some authors. There are other uses of the word predicate, a common one being to use predicate more or less in place of relation symbol.)

A more direct way to correlate the relation symbol f with the real world, without relying on any

*meaning*to be evaluated by human judgment, is to provide the exact values for which f evaluates to true. This set of values is a subset of the Cartesian product F × P of the two sets F and P, and an interpretation of f can be given in the form of such a subset of F × P – a set of ordered pairs of values for which f evaluates to true.

For instance, one interpretation could be:

{ (anakin, leia), (george, george_w), (frank, dweezil) }.

There you have an explanation for the common way to define the term relation in mathematics simply as a subset of a Cartesian product.

A basic idea of the relational model is that sets such as these – relations, or

*extensions*of the corresponding predicate – should be the only concept used for representing all data in the database, although as we shall see, it commonly is not.

The values

*anakin*,

*leia*,

*george*, etc. are supposed to be objects from the “real world”, but when displaying the value like this we obviously have to use some sort of representative (in the form of character strings) rather than the actual objects (people). It is to break out of traps like this one that texts on logic sometimes contain peculiar assertions like “constants are interpreted as themselves.”

The parenthood relation is a binary, or

*2-ary*, relation; and its elements are pairs, or

*2-tuples*. In general, a relation symbol can be n-ary for any natural number n. For instance, we can write

p(X, Y, Z)

and let p be interpreted by, for instance, the predicate “X is a person who was born in city Y in the year X”, or by the relation

{ | (george, milton, 1924), |

(george_w, new_haven, 1946), | |

(frank, baltimore, 1940), | |

(dweezil, los_angeles, 1969) }. |

This demonstrates the elegant uniform-concept idea of representing data as relations. In a model that has both entities and relationships, the

*person with birth city and year*data would typically be perceived as an entity, while

*parenthood*would be a relationship between records of the person entity. In the relational model, both sets of data are on the same basic form: relation. A fatherhood relation between two person object domains, and a person relation among the three domains

*person name*,

*city*, and

*year*.

Photo by Peter Lindberg (some rights reserved).

**Non-Logical Data**

A convenient way of visualizing a relation, common in association with the relational database model, is to use a table:

father (possible_father) | child (person) |
---|---|

anakin | leia |

george | george_w |

frank | dweezil |

In the relational database model, the heading (schema), which is comprised of a

*name*and

*data type*for each column in the table (or in an alternative terminology, for each attribute in the relation). These

*metadata*are not percieved as actual data in the relational database model, even though it is something that the user of the database system is expected to provide.

In addition, relational database systems typically comprise

*constraints*of one kind of another. These are also supplied in ad hoc, non-relational, ways.

So, the relational database model breaks its own principle, that all data is represented in the form of relations, by introducing several additional forms of data. Classical logic finds no need for these special forms.

**Algebra vs Logic**

In his original account of the relational model,

^{2, 3}Codd did briefly suggests that a data language based on first order predicate calculus should be developed, but, unfortunately, that did not make much of an impression on the database community. The paper also contained the embryo of another data language:

*relational algebra*, a set of “operations on relations” which was later modified and expanded, and inspired various database languages. The languages grew more and more ad hoc and complex, and distanced from the ideas of logic. During the 1970s, the database language SQL was developed, with at least some involvement from Codd, and today SQL in various forms is the king of illogical relational languages.

Obviously, many people have noted the complexity and overhead problems of the relational systems in use today. Unfortunately, many draw the incorrect conclusion that the problems are in the relational model as such, rather than the implementations being stuck in old ways, partly caused by inadequacies buried in the SQL language. Hence, they choose to revert back to pre-relational tree-, network-, or record-oriented schemes.

Some however – particularly C. J. Date and Hugh Darwen, sanctioned by the whining but brilliant grump Fabian Pascal – have made heroic attempts to bring the relational model back to its roots. They have defined their own relational language dubbed “D”, which throws out the worst atrocities of SQL, and strengthens the declarative aspect.

They do not, however, go all the way to the elegance and simplicity of classical logic. (Although they have, as an aside in some of the Third Manifesto writings, taken the first step towards doing so with the alternative algebra “A”.) That is a pity, because expressing data in pure logical form makes for an excellent model. In logic, there is nothing special with concepts such as data types, constraints, comparisons, or arithmetic operations. They are just special cases of the basic simple constructs. Arguably, logic can provide the simplest model possible that allows defining and accessing databases, avoiding a lot of the awkwardness that comes from the directional

*operations*of relational algebra.

Others have realized this, of course. As early as in the 1970s there was substantial research on the connection between logic and databases. Look back, and you will find pioneer work by, among others, Jack Minker and Raymond Reiter, and a lively research community. But for some reason, this was mostly not considered database research, but classified as

*artificial intelligence*or

*logic programming*, and database people appear to have taken little notice. The field more or less died out in the late 1990s, without having come through to the industry or database community.

**Simplicity Enables Efficiency**

Taking a logical approach to databases is not just an exercise in mathematical aesthetics. One of the most important points is simplifying the interface to database management core, in keeping the interface to higher levels of the system small. This takes a substantial workload off the shoulders of the system implementer.

In particular, a logical description of operations simplifies formulation of

*optimization*tasks of the database system. Query optimization for SQL, with all its complexity, quirks, and inconsistencies, is a nightmare. Furthermore, for various reasons (which I hope to get back to in a future post), the actual operation of database applications using SQL systems are mostly not expressed in the language of the database at all. This takes the job of capable overall optimization from the level of practically impossible to

*theoretically*impossible.

Optimization of database operations on the basis of pure logic, on the other hand, allows the system to concentrate on the actual computational complexity issues of translating logical formulae into execution plans. Not that this makes it easy, but arguably it makes it as simple as theoretically possible.

**Revival?**

The Claremont Report on Database Research, a document that has come out of a meeting of a number of distinguished database specialists in May 2008, mentions that “new declarative languages, often grounded in Datalog, have recently been developed for a variety of domain-specific systems.” Datalog is a Prolog-like language for database access, which dominated the database logic research during the 1980s, so this statement could be taken as an indication that a revival for logic in database access is on the horizon. I suppose our platform, Apptus Theca 2.0, can be said to be one such system, although our current interface is based on a graphical user interface rather than a language in text form.

I have not, however, come into direct contact with any such projects apart from our own, neither academic or commercial. If you are involved in, or aware of, anything like this, I would love to hear from you.

**References**

**1**Wilfrid Hodges,

*Classical Logic I: First-Order Logic*, The Blackwell Guide to Philosophical Logic (Lou Goble, ed.), Blackwell Publishers, 2001.

**2**Edgar F. Codd,

*Derivability, Redundancy, and Consistency of Relations Stored in Large Data Banks*, IBM Research Report, 1969.

**3**Edgar F. Codd,

*A Relational Model of Data for Large Shared Data Banks*, Communications of the ACM 13 (1970), no 6, 377–387.

## 2 comments:

Dear Jesper,

I stumbled upon your blog because I'm currently investigating how to efficiently find duplicate code in a large code base, and you have apparently written a PhD thesis on roughly that topic, which I have opened but not yet read, although it looks very readable.

But the sentence you quote from Wikipedia: "The relational model for database management is a database model based on first-order predicate logic, first formulated and proposed in 1969 by Edgar Codd." happens to be one I wrote, so it's hard for me to resist commenting on this piece.

I like your piece. It is well written and it reflects some ideas that are apparently shared by many. However, it also contains some remarks I feel are wrong.

First, I agree with you that the connection between relational databases and logic doesn't seem to be as clear as it could be. Before I added that sentence, the Wikipedia article didn't actually explain the connection at all. But I did attempt to add such an explanation, and

you appear to have missed it.

In logic, we distinguish between a set of logical formulas (e.g. formulas in first-order predicate logic over a finite set of predicate variables, each with a certain arity) and interpretations of such sets of formulas (e.g. stuctures to which the formulas can be applied). An interpretation is a model of a set of formulas if all of the formulas hold in that interpretation. In the relational model, database schemas are the signatures of predicate variables (relvars); constraints are first-order formulas of predicate logic over those variables; the possible database instances (i.e. valid contents of a database of a given schema with the given constraints) are exactly the finite models of these formulas. So the connection is quite a bit closer than you suggest. And constraints are logical formulas (but rules for their resolution, such as "ON DELETE CASCADE", are something else).

Second, I feel you are glossing over a fundamental issue in database theory by following Date & al. in their blind denouncement of references. It is not inherently illogical to allow references such as "the person named Jesper Larsson" into our logic: we can extend logic to capture such references; Richard Montague did this in the 70s.

Third, I don't understand your denouncement of relational algebra.

Finally, I think that, like Date, you're too harsh on SQL. It really isn't so far from predicate logic as you suggest.

There is more, but if you have more time, please focus on that followup you planned. I'm really looking forward to it.

Sorry for not responding to comments lately. The reason is, as you might guess from the lack of new post, that this blog has gotten left behind in my work as my company has moved off in new directions. But I hope to publish at least one post soon, to round things off.

Another thing: there has been an enormous amount of comments on this blog lately, most of which were obvious spam, but since the commenting is moderated, I am not letting them trough. For some comments, however, it is difficult to determine if they are real comments or not. There has been a number of comments containing nonspecific praise that was obviously bot-generated. (The motive eludes me – there are usually no links in the comments.) Thus, it is possible that I have rejected or deleted some comments that were actually real comments and not spam, just because they looked a bit like these vaguely praising spam comments. If so, I apologize. I may disable commenting completely if spamming continues to accelerate.

Post a Comment