Thursday, October 30, 2008

Models and Efficiency, part 2: Databases

by Jesper Larsson

Databases is one of the few areas in computing where there is a noticeable awareness of models. At least, models are talked about. But actually quite few of the people who work with database systems have more than a very vague view of what data models are or what purpose they serve.

Photo by Marco Arment (some rights reserved).

I explained my view of what a model is in the previous post. A data model then, is a model for dealing with general-purpose data. Data as such is not that common in most computer users' experience these days. People usually have no need to think of the data that carries the information they work with, and systems are kind enough not to bother the users with the low-level concept of data. Only systems that deal with any kind of data, regardless of what it represents, need to include the notion of data in itself in the model they expose to the user. Particularly, data is a concern for database systems – which support intricate operations on general-purpose data. Hence the importance of data models in this context.

The Dominating Data Model

Out of the plethora of data models in use by database systems (some more vaguely defined than others), the most popular one, to the extent that we can skip all the others for the time being, is the relational model. Or perhaps I should say subset of the family of relational models – for there are several. Even the instigator of the term, E.F. Codd, did not present a single coherent model, but seems to have invented a new slightly different variant practically every time he presented it. For this post, however, it will do to to talk about the relational model as if it was one.

An average database professional trying to explain the relational model typically starts by stating that in the relational model, data are stored in tables. Not a good start. First of all, tables do not really belong in the model, they are visualizations of relations – but that is a minor point.

More importantly, the model says nothing of storage! It acts as an interface, for direct users and for surrounding systems that interact with the database. Behind the scenes of the relational model, the system can deal with storage in whichever way it pleases – or whichever way the database administrator configures it. There are endless possibilities for data structures and access methods

Photo by Phil Aaronson (some rights reserved).

The implementers of early database systems, such as System R, chose to use a simple one-to-one mapping between the model and the physical storage. It was a natural choice as a first attempt. What they were doing was trying things out, not creating the ultimate system.

But somehow, the idea of a simple connection between model and storage got stuck in people's minds, and in the major database systems on the market.

Logical–Physical Separation

Recall the inverted text example from a previous post. Here is a variant with the less interesting words removed, and with only the document numbers where the words appear:

Word Doc. no
art 1
art 3
war 1
war 2
modern 3
peace 2

Many people would say that this is an inefficient representation, because it contains multiple copies of the words that appear in more than one document. To make the representation more compact, they might try using a list of document numbers as the second column, perhaps specified as a “LOB” – a large object.

This is exactly what major database vendors do for their text indexes, which is one of the many choices they make to go against the basic idea of the relational model. Unnecessarily so.

Because, again, the model is not the storage. It is the user interface. The relational structure is not necessarily the same as that in the physical representation. The following figure illustrates how the logical model representation (at the top) can be tied to a compact physical disk storage (bottom).

This is just one possible representation, of course. Ideally, the physical representation should be chosen so that it makes critical operations efficient in the particular application where the database is to be used. The relational model still lets users access data in any way they please, but queries that the chosen implementation is not directly suited for may be execute less efficiently.

A problem with the basic design of major relational database systems is that they do not have enough information to make a good choice of representation. It is impossible for them to obtain the information about the application that they would need.

The Basis of Misunderstanding

People get the wrong idea about the role of the data model partly because they start using systems without learning the basic theory, but another reason, which I consider more critical, is how the user is forced to interact with the system in order to get reasonable performance. (By user in this case, I mean the person who uses the database system for implementing an application.)

In a perfect world, the way it is supposed to work is that the user provides logical definitions of the data, and the system somehow magically chooses the best implementation. This is a tough task, not least because the system must choose the implementation before it has had a chance of getting any feedback as to its use. In practice, this simply does not work.

Hence, the user, who does have an idea about which operations need to be efficient, needs to convey more information to the system than just the logical definitions. (Or move to a specialized database system for the application, if there is one.) And the only way that the user can interact with the system is through the relational interface!

This has two effects. First, database system vendors add physical aspects to the relational interface that are really not relational or logical at all. SQL is full of this – indexes for example. There are no indexes in the relational model; they belong in the implementation.

Second, the user is led into using implicit couplings between the relational definitions and the physical storage for improving performance. Denormalization is a blatant example. Relational theory tells us that that the relational structure of the database should be normalized to avoid integrity problems. Yet, practitioners choose to do the opposite because it can give them a performance advantage.

The result of all this is that it is difficult to spot the actual relational model in the major relational database systems.

Throwing Out the Model

People have started to move away from relational database systems, mainly because of performance issues. A range of more specialized systems for narrower applications is emerging. I have no principle objection to that; using the same database system for everything is no end in itself.

However, I am convinced that there will still be a place for systems that are general-purpose enough to explicitly expose data to the user, without narrowing down what the data may represent, and accept complex specifications on relationships and retrieval operations. For this, a simple logical model that represents data as relations is extremely powerful.

Unfortunately, the relational model has become so closely associated with SQL that people tend to discard the model because they are not satisfied with SQL. The end of the relational era is proclaimed, when it should really be just the end of the SQL era.

I agree with the relational evangelists that the reason why SQL databases are so displeasing is not that they are relational. Rather, one of their shortcomings is that they are not really relational. However, I am not convinced that the third manifesto systems that the true relational lobby brings forward can take their place. The relational model itself does have some drawbacks, which I leave as a subject for future posts.

Friday, October 24, 2008

Practical Stuff: Models, Theories, Abstractions

by Jesper Larsson

They are usually mentioned only in a few fields, but models are actually everywhere in computing. In fact, it is the key concept that lets computers do things for people to find interesting.

The word model can mean many things, some of which are shown in the collage below. I am not talking about any of those. Nor am I talking about the technical use in logic (an assignment of constants to variables that makes a sentence true), despite the close tie between logic and computing – particularly databases.

Models Photos by Nguyễn Thành Lam, Erik Hersman, Richard Smith, MaZzuk (some rights reserved).

No, I am talking about a model in the sense of an abstraction. A simplified representation of something. A framework for naming and envisioning non-physical objects to allow reasoning about them and shaping theories about them. Often by use of metaphors – analogous concepts picked from a well known domain.

The picture below is an example of a sort of abstract model that most computer users encounter. It shows a set of folders, or files, which in turn are the contents of another folder. But of course, they are not really folders. Folders are made of paper and kept in file cabinets. These are just images on the screen that symbolize something that we think about as analogous to real folders.

Folders The physical constructs beyond the abstraction is not something the average user needs to be concerned with. In fact, the folders can, and do in this case, represent quite different things. Some of the folders in the picture are things that are kept on my local hard drive, but some are network connections. Since they, from a certain aspect, all support the same set of operations, we can interact with them using the simplified view that they are a homogeneous set of objects. That is part of the model of this file browser interface.

Computers operate through level upon level of models, starting (unless you want to get metaphysical) with voltages in electrical circuits, progressing to boolean logic, machine language, and so on, and perhaps via one or several virtual machines, up to, finally, the model that the user of the program understands.

Intuitive Models

Models were not invented for computing. Everybody thinks intuitively in levels of abstraction. Generally, people have no problem accepting that if you look under the surface of the simplified concepts we accept as objects or concepts – things of various kinds – you find a more fine-grained physical level of existence, which is quite different from what we usually perceive.

But strangely enough, some people, computing professionals in fact, sometimes fail to understand that the model can be independent of the lower physical level. I have suffered from this myself, and perhaps I still do in some areas, but I have gotten a lot better over the last few years.

Programming Languages

When I have told people that I develop high-performance systems in Java, I have often (less now than a few years ago) had the surprised reaction “but Java is so slow, how can you get it fast enough?”

Java is a language which, like other programming languages, expresses sequences of logical (in a wider sense) operations. How could a language possibly be slow, or fast for that matter? Consider the following Java code that calculates a number in the Fibonacci sequence:
    long s = 0, t = 1;
    for (int i = 0; i < n; ++i) {
        long u = s + t;
        s = t;
        t = u;
    }

Now, take a look at the equivalent code in C++:
    long s = 0, t = 1;
    for (int i = 0; i < n; ++i) {
        long u = s + t;
        s = t;
        t = u;
    }

Which one is faster? How can any of them be faster? They are exactly the same! (Java and C++ are closely related languages with the same basic syntactic elements.)

Now, perhaps you think I am just being difficult. Obviously, what people mean when they say that Java is slower than C++ is not that the language per se is slow. It is that code compiled with a Java compiler and executed in a Java runtime environment is slower than the same code compiled by a C++ compiler and run by the operating system directly. This is because the Java runtime environment is a virtual machine – an interpreted system. It is a program that looks at one instruction at the time and changes its state accordingly, just like the physical machine that the operating system is running on. This extra level adds overhead.

Is this true, then?

It certainly was true – about ten years ago. But then just-in-time compiling came into use. Virtual machines with this feature did not execute instructions directly, but compiled them to machine code the first time it saw them, and fed the machine code into the physical machine. This could get almost as fast as compiling directly, but it still had the overhead of compiling in runtime.

Finally, the HotSpot virtual machine arrived. It did not just compile everything it saw. It gathered statistics about which parts of the code was run more often, and made a greater effort to produce efficient code for the more frequent parts.

About five years ago, when I got into a discussion with someone who claimed that Java was inevitably slower than C++ for tight loops, I did a comparison between Java and C++ execution, on tight-loop code similar to the Fibonacci calculation code above. The Java code executed in the HotSpot virtual machine was significantly faster than the C++ code compiled by GCC with full optimization and run directly on the physical machine.

The explanation is that the Java environment, doing the compiling in runtime, has more information at its disposal than the C++ compiler. Hence, it is able to generate more efficient machine code.

This was (and still is, I imagine) astonishing to a lot of people, stuck with the interpretation overhead stigma.

The point of all this is that the language – the model, so to speak – does not determine whether executing code is efficient or not. The language is just a way of expressing transitions from one machine state to the next. What ultimately should matter for efficiency is the computational complexity of computing the next state. There is, in theory, absolutely no reason why code expressed in one model should be less efficient than another. If the system is smart enough, it can execute it in the most efficient way anyway. This is the true meaning of the word optimization.

What is called optimization in practice is not really optimization, because it rarely finds the optimal code. It just finds code that is, hopefully, better than a naive translation from high level language to machine code. I suppose true optimization in general is impossible. (I have not found an actual proof, but it feels like it would be related to computing Kolmogorov complexity.) However, in every case where the system can recognize a piece of code that expresses something it has an efficient execution method for, it can use that method.

The evolution of Java systems is an excellent example of how an improved implementation can increase performance without changing the model.

Next Time: Data Models

The main point that I want to make about misunderstood models has, as you might expect (and as promised at the end of the previous post), to do with database systems. But this post is already quite long, and I do not want to cram in the most important subject at the end. You may already have guessed the essence of it, but for the details I again have to refer you to a future post.

Monday, October 20, 2008

Text Search and the Relational Model

by Jesper Larsson

About seven years ago, I gave a presentation to a large group of potential customers about the text search product that was our big thing at the time. I explained that our storage was more compact than in the relational model. I was so wrong.

Now I know better. Unfortunately, most of the industry and much of academia does not. People get away with saying the things that I did without being laughed at, just as easily as ever.

Not understanding what the relational model really is, people take it to be what the major relational database systems do. Hence, when people cannot get what they need from those systems, they conclude that the relational model is not suitable for their needs. So, they throw away one of the most powerful abstractions for data processing without even looking at it.

It is particularly common to write off the relational model in the text search field. You can frequently hear people say that things like the relational model is not suitable for capturing the structure of text, or referring to text by the peculiar term “unstructured data”. Some writers talk about combining text and relational data, as if there were a contradiction. As if relational data were a special kind of data. A more correct account, like that of Curt Monash, is to simply note that text does not work very well with the architecture of mainstream relational products – which is true. Text search applications based on major database systems usually turn out painfully inefficient. One of many points where the architecture of those systems does not match the demands of the Internet world. Database system vendors are unable to adopt, which has created a separate market for search engines.

But that avoids the core issue. Since text search is one of my top areas of expertise, I hope I can explain to you why the relational model is perfectly capable of capturing the structure of text. I'll start at the very bottom, explaining what text search really is.

Searching for Words

Boiled down to the most fundamental formulation, the text search problem is to answer the query “which documents match the text I give you?” The result is a set of document identifiers of some kind – titles, file names, internet addresses or something, depending on the application. Let us assume, for simplicity, that each document has a unique number that you use for identifying it.

Then you might have, for example, a database such as this one:

Text Doc. no
war and peace 1
art of war 2
modern art 3

The text data here may not be something that you would normally consider a document, but that is just because I want to keep the example small. The same principles apply if the texts are hundreds, thousands, or millions of words long, and if if there are millions of documents.

Say that you want to search for a single word, war for instance. Scanning through all the text of all the documents would probably not be efficient enough for a real-life database, so you need some sort of data structure to make searching a quicker task.

A method that you are sure to be familiar with, even if you know nothing about computing, is to line up items alphabetically to allow searching for something without looking through all the data. A computer program can use binary search to locate something quickly in an ordered list, or it can build a search tree that allows the same sequences of comparisons without having to physically align the sorted items. (One out of a bunch of standard search methods at the programmer's disposal.)

When you create an index for a data field (a table column) in a common relational database system, this is essentially what happens. Most commonly, some sort of B-tree is created, and subsequent requests for records based on that field can use the tree to speed up finding the right records.

So, an index for a text field can be visualized as another table, where the rows are ordered alphabetically:

Text Doc. no
art of war 2
modern art 3
war and peace 1

This lets us find document no 1 quickly by looking for war in the ordered text field. But what about document 2? It contains war as well, but the index is no help since the word is not at the beginning.

This is where common database systems give up and tell you to stop using the relational model. Instead, they supply you with a special kind of text index, that can be invoked either with a special operator to specify that a record contains a word, or using a hierarchical programming interface.

You may find it a bit surprising that they give up so easily, because it should be obvious to almost anyone, including major database vendors, that the standard text search practice of inverting the data can be applied like this:

Word Doc. no Position
and 1 2
art 2 1
art 3 2
modern 3 1
of 2 2
peace 1 3
war 1 1
war 2 3

This table tells us, for each word, in what position in what document the word can be found. In fact, it conveys exactly the same information as the original table. The difference is that with this organization, it is simple to express queries such as “which documents contain the word war” using just a normal relational language. No special operators, no non-relational interface.

In addition, unlike a specialized text index, it allows you to express a bunch of other queries about the words and the documents. This is one of the strengths of the relational model: it uses a small, fixed, set of operators, but still lets you formulate practically any query that you can think of.

Say that you want to find which documents contain both war and art. In a relational language, this is easily expressed as the intersection between the documents containing war and those containing art. (There is actually also a more intriguing way to express it that I will get to that in a future post.) By contrast, if your word query construct is a special contains operator, you need to add support for a more complex argument than just a word. For instance, you could say something like “contains(war and art).” The argument string, “war and art”, is something that the implementation of the contains operator has to parse and make its own kind of sense of. In effect, this means that you have to have a whole separate query language for the contains argument alone!

This should really be enough to convince anyone that the relational model is suitable for capturing the structure of text.

To be fair, we should note that there is no way to obtain word-oriented organization from the document-oriented one using standard relational syntax. This is because it depends on a particular interpretation of the text field in the original table – that it consists of space-delimited words. So you either have to present the system with the word-oriented data as input, or include some way of specifying how the text field is to be broken down to multiple word values.

Model vs Implementation

Indexing by inverting text data is the basis of pretty much all information retrieval. It is how the text indexes of the database products works below the surface, and it is the underlying method for most search engines you find on the Internet – large and small. (Although there is a rumor, supported by a peculiar change of behavior, that Google recently changed their basic search method to something completely different.)

So why don't the major relational database systems expose their text indexes in a relational way? The answer lies in the core problem shared by all those systems, as far back as System R: the lack of separation between that which is logical and that which is physical, between model and implementation.

That is a subject that deserves to be treated as something more than a side note. It will be the subject of an upcoming post.

Friday, October 3, 2008

Why the Database Masters Fail Us

by Jesper Larsson

There is hardly any field in computing more plagued by religious wars than database systems. For nearly 40 years, the battle has raged between various architechtures, occasionally with some combattants replaced – or at least renamed. Still, it seems that we are further away than ever from a sort of database that we can be satisfied with. (I am not going into the details of the problems right now, that will be a subject of future posts. But if you are in the business I am sure you are familiar with some of them.)

Let us take a couple of steps back and take a look at the situation. Let us forget, for the moment, our personal stance in the fight on data models, platforms etc., and ask ourselves: who do we depend on to design database systems? What is driving them? Could they do better? Could we do better, with a different kind of effort?

I have come up with three groups of operators that influence the design of database platforms. I call them the paper writers, the evangelists, and the merchants. Anyone who creates the actual code that makes up the database systems is a servant of one or more of these masters. Let us go over them one by one.

The Paper Writers

What would be a better source of knowledge to base our system design on than science? According to popular view, scientists, or more specifically academic researchers, have the task of advancing human knowledge. Using objective observation and rational reasoning, they find the ultimate truth, unaffected by fads or short-sighted economics.

Unfortunately, it does not quite work that way.

The primary concern of most academic researchers is to produce papers – articles published in conference proceedings or scientific journals. Published papers are what they are judged by, the most important merit for their Ph.D. degrees and research grants.

Consequently, academic researchers learn to become experts on getting papers published. They have an ambition to advance human knowledge too, but that is a far-fetched goal that only established stars can afford to have as their first priority. In daily work, the immediate focus of most researchers is to impress peer reviewers – people in their own field who decide whether papers get accepted or not.

For several reasons, this makes scientific work less useful for practitioners than you might expect.

First, it influences which subjects get explored. Researchers tend to pick subjects that are currently in fashion, for which papers are in demand. The result is that current trends have a large impact on what subjects people choose to work with.

Second, it has an effect on the language of published research. Since writers and reviewers are actually the same people – the researchers involved with the field – writers use language meant to be understood and judged as appropriate by other people like themselves. This has a self-amplifying effect, with the result that publications often seem impenetrable or irrelevant to people outside the field.

Third, once the papers are published, the work is finished as far as the researcher is concerned. Few researchers bother to take their findings any further.

The consequence is that academic research rarely gives us comprehensible knowledge of how to design a system. What we mostly get are thousands of fragments of potentially useful knowledge, clumped together in bursts around subjects that are popular over a few years, and usually presented in a language that is difficult to penetrate.

Many research projects include implementation of actual software systems to test or demonstrate research findings. Sometimes they develop into industrially useful ones, but only rarely. Research systems are typically not fully functional or efficient enough for general practical use, and it would hardly be fair to expect them to be, especially for software as large and complex as modern database systems are. After all, the point of academic research is not to produce ready-made systems.

The Evangelists

There are a number of people out there who claim that they have the correct view of how a database should be constructed, and that essentially everyone else is wrong. If people would just listen to them, everything would turn out fine. Of course, oddballs, fanatics, and charlatans exist in every field, but none of these labels quite captures the database evangelists – at least not all of them.

In particular there is a group of people who persistently promote the relational model. But, the relational model is already dominant in the database world, isn't it? Is not everything fine, then? No, this true relational model lobby, with esteemed relational database pioneer C.J. Date as their figurehead, claims that the version of relational databases that dominates the industry is distorted; that the relational model is misunderstood or mistreated by practically everyone, even to some extent its inventor E.F. Codd!

I may have made this sound a little more eccentric than it deserves. The fact is, I principally agree with most of what Date and his allies say. However, even if they are correct about the data model, they do not have all the solutions needed to create a full database system. In their fervor to promote the true relational model, there are a number of problems that they de-emphasize or do not address at all. Hardware utilization and efficiency, for instance, they write off as somebody elses problem.

On the other hand, an extreme pragmatic lobby has recently emerged, centered around Michael Stonebraker, another veteran of the database field. Although Stonebraker's thesis that it's time for a complete rewrite of database products could plausibly be supported by C.J. Date, Stonebraker could not care less about purifying the relational model. He currently endorses abandoning the very idea of general-purpose database systems for specialized, application-specific solutions.

The database evangilists have an influence through their writings as well as through their contacts with implementation projects in academia or in the industry. Most of the time, their direct impact is minor, but they may have important roles as architects of future systems.

The Merchants

The vendors that make their living producing database systems obviously want to sell their products or services to as many customers as possible. This makes them keenly monitor what the market seems to want, and declare that this is just what they have – sometimes adjusting their products accordingly.

On the one hand, merchants tend to be conservative, at least on the main issues. They are frightened by radical new ideas that threaten to be costly both to them and to their customers. Database management platforms are heavy components in most IT infrastructures, coupled with large investments and legacy issues. Rather than improving their systems at the core, merchants prefer to add peripheral components, covering more and more of customers' software needs.

On the other hand, the merchants must be extremely sensitive to trends. They have to keep up with the latest buzzwords, not to appear to be falling behind the competition. Also, they are always on the lookout for new features – minor extensions that they can use as selling points.

The result, when merchants get their way, is that once their product has established a decent market share, it more or less stops developing, and starts growing instead.

Who Will Create the Perfect Database System?

As you can see, if you roughly agree with my outline of the operators, nobody really has both the will and the capability to create a good database management system. How is this different from other areas of the software industry? Simply because a database platform is such a monumental piece of software, not only to produce, but also to use. Changing your database system can be more demanding than replacing your operating system. To both create and sell a major product is a gargantuan task.

It is unlikely for any company or academic institute to successfully take on the task of designing and producing from scratch, and then selling to the market, a database system that is different enough to take a major leap forward. It can happen – it has happened before – but the way the industry has developed, it has become a lot more difficult since the last major shift with the relational database breakthrough in the early 1980s.

More likely, new systems will evolve slowly. People will abandon the monolithic one-size-fits-all database systems, as Stonebraker predicts, and create specialized lightweight solutions for various applications. This has already happened in some areas.

However, I am convinced that there is a place for general data management platforms in the future. It is simply too much work for everyone to roll their own.

So What am I Selling?

Who am I to say all this then, and what is my interest in the business? I am head of research at a medium-sized Scandinavian company named Apptus Technologies, which started out offering services around existing major database systems. We suffered from the sluggishness of the platforms, adopted to it, and ultimately made our business from it.

For a number of years we have produced search systems for large and complex data sets, to be accessed over the internet by millions of users. We achieved efficiency and flexibility by gradually moving away from major database platforms. Ultimately, we created our own full-fledged database management system. We have never sold it as a standalone system (nor do we intend to in the foreseeable future), only used it as a base for more specialized systems. Therefore, we have enjoyed a freedom in choosing how to develop our platform that database system vendors do not have – including the freedom to change our minds.

We have learned a lot during the last eight years, gained a lot of insights, and developed some strong opinions in the process. Now, we have decided to stick our heads out of the laboratory a bit, and try to start a conversation with the world outside. Not just about the advantages that our technology can produce for our customers (we have a sales department for that), but about the technology and ideas behind it, as well as our visions for the future of information systems. It is going to be a lot about databases, but also about programming and computer science in general.

I hope you will join us. Stay tuned to this blog. Subjects coming up: why you can benefit from rewriting your system rather than patching it; how we started out misunderstanding relational databases, and why most people still do.