Monday, October 20, 2008

Text Search and the Relational Model

by Jesper Larsson

About seven years ago, I gave a presentation to a large group of potential customers about the text search product that was our big thing at the time. I explained that our storage was more compact than in the relational model. I was so wrong.

Now I know better. Unfortunately, most of the industry and much of academia does not. People get away with saying the things that I did without being laughed at, just as easily as ever.

Not understanding what the relational model really is, people take it to be what the major relational database systems do. Hence, when people cannot get what they need from those systems, they conclude that the relational model is not suitable for their needs. So, they throw away one of the most powerful abstractions for data processing without even looking at it.

It is particularly common to write off the relational model in the text search field. You can frequently hear people say that things like the relational model is not suitable for capturing the structure of text, or referring to text by the peculiar term “unstructured data”. Some writers talk about combining text and relational data, as if there were a contradiction. As if relational data were a special kind of data. A more correct account, like that of Curt Monash, is to simply note that text does not work very well with the architecture of mainstream relational products – which is true. Text search applications based on major database systems usually turn out painfully inefficient. One of many points where the architecture of those systems does not match the demands of the Internet world. Database system vendors are unable to adopt, which has created a separate market for search engines.

But that avoids the core issue. Since text search is one of my top areas of expertise, I hope I can explain to you why the relational model is perfectly capable of capturing the structure of text. I'll start at the very bottom, explaining what text search really is.

Searching for Words

Boiled down to the most fundamental formulation, the text search problem is to answer the query “which documents match the text I give you?” The result is a set of document identifiers of some kind – titles, file names, internet addresses or something, depending on the application. Let us assume, for simplicity, that each document has a unique number that you use for identifying it.

Then you might have, for example, a database such as this one:

Text Doc. no
war and peace 1
art of war 2
modern art 3

The text data here may not be something that you would normally consider a document, but that is just because I want to keep the example small. The same principles apply if the texts are hundreds, thousands, or millions of words long, and if if there are millions of documents.

Say that you want to search for a single word, war for instance. Scanning through all the text of all the documents would probably not be efficient enough for a real-life database, so you need some sort of data structure to make searching a quicker task.

A method that you are sure to be familiar with, even if you know nothing about computing, is to line up items alphabetically to allow searching for something without looking through all the data. A computer program can use binary search to locate something quickly in an ordered list, or it can build a search tree that allows the same sequences of comparisons without having to physically align the sorted items. (One out of a bunch of standard search methods at the programmer's disposal.)

When you create an index for a data field (a table column) in a common relational database system, this is essentially what happens. Most commonly, some sort of B-tree is created, and subsequent requests for records based on that field can use the tree to speed up finding the right records.

So, an index for a text field can be visualized as another table, where the rows are ordered alphabetically:

Text Doc. no
art of war 2
modern art 3
war and peace 1

This lets us find document no 1 quickly by looking for war in the ordered text field. But what about document 2? It contains war as well, but the index is no help since the word is not at the beginning.

This is where common database systems give up and tell you to stop using the relational model. Instead, they supply you with a special kind of text index, that can be invoked either with a special operator to specify that a record contains a word, or using a hierarchical programming interface.

You may find it a bit surprising that they give up so easily, because it should be obvious to almost anyone, including major database vendors, that the standard text search practice of inverting the data can be applied like this:

Word Doc. no Position
and 1 2
art 2 1
art 3 2
modern 3 1
of 2 2
peace 1 3
war 1 1
war 2 3

This table tells us, for each word, in what position in what document the word can be found. In fact, it conveys exactly the same information as the original table. The difference is that with this organization, it is simple to express queries such as “which documents contain the word war” using just a normal relational language. No special operators, no non-relational interface.

In addition, unlike a specialized text index, it allows you to express a bunch of other queries about the words and the documents. This is one of the strengths of the relational model: it uses a small, fixed, set of operators, but still lets you formulate practically any query that you can think of.

Say that you want to find which documents contain both war and art. In a relational language, this is easily expressed as the intersection between the documents containing war and those containing art. (There is actually also a more intriguing way to express it that I will get to that in a future post.) By contrast, if your word query construct is a special contains operator, you need to add support for a more complex argument than just a word. For instance, you could say something like “contains(war and art).” The argument string, “war and art”, is something that the implementation of the contains operator has to parse and make its own kind of sense of. In effect, this means that you have to have a whole separate query language for the contains argument alone!

This should really be enough to convince anyone that the relational model is suitable for capturing the structure of text.

To be fair, we should note that there is no way to obtain word-oriented organization from the document-oriented one using standard relational syntax. This is because it depends on a particular interpretation of the text field in the original table – that it consists of space-delimited words. So you either have to present the system with the word-oriented data as input, or include some way of specifying how the text field is to be broken down to multiple word values.

Model vs Implementation

Indexing by inverting text data is the basis of pretty much all information retrieval. It is how the text indexes of the database products works below the surface, and it is the underlying method for most search engines you find on the Internet – large and small. (Although there is a rumor, supported by a peculiar change of behavior, that Google recently changed their basic search method to something completely different.)

So why don't the major relational database systems expose their text indexes in a relational way? The answer lies in the core problem shared by all those systems, as far back as System R: the lack of separation between that which is logical and that which is physical, between model and implementation.

That is a subject that deserves to be treated as something more than a side note. It will be the subject of an upcoming post.

2 comments:

Anonymous said...

You really got me into thinking with this post. Waiting for the followup!
Keep up the work :)

Jesper Larsson said...

Thanks, oerd. The followup that concerns the cliffhanger at the end of this post is already posted. I took this opportunity to add a hyperlink to it.