Warning - the following is NOT YET implemented as described. I merely wanted to present my current thinking and plans on the issue of NULLs. -jcw

Data is rectangular. The world isn't.

Vlerq's data structures are based on the relational model. Facts are rows, collections of facts are relations ("views" in Vlerq). The following example shows that this approach leads to a rectangular world:

There are three facts about people, which we tag with their name (the key in this example), as well as details about their age and their shoe size.

But what if we didn't know Kim's shoe size? In a pure relational approach, you then have to split off columns to the point where only known facts get stored:

You can then join these relations on "name" when you need information about both age and shoe size. Unfortunately, that means you can't manipulate all data at once this way. In a join, Kim's information will be omitted because we can not represent a combined fact about her specifying both her age and her shoe size. In theory, the above split is very clean. In practice, it gets really tedious.

In SQL we can represent the fact that we don't know Kim's shoe size by using "NULL" as a special placeholder:

So far so good. Ok, let's find everyone with shoe size over 35:

Now everyone else, i.e. those with shoe size not over 35:

Oops, where's Kim?

NULL introduces the concept of three-valued logic. Some facts satisfy a condition, others do not - and then there are those for which we can't decide.

For a taste of what NULL leads to in SQL, see http://dev.mysql.com/doc/refman/5.0/en/working-with-null.html . This example does something totally horrid by the way: it places the words "NULL" and "value" side by side.

The phrase "NULL value" is a contradictio-in-terminis: NULL is NOT a value!

After all, that's the whole point of NULL: to represent the absence of a value. The widespread usage of the phrase "NULL value" illustrates how absence of a value gets confused with presence of a special marker. Ouch, ouch, ouch.

Vlerq will need to deal with missing values. For two reasons: to be usable as foundation for SQL-like layers on top, and to be able to represent XML well.

Conceptually, the way to do this is quite simple: for each item, store an extra flag which indicates whether the value is present or missing. If valid, then the item contains a value - if not, ignore any value stored for that item.

In practice, Vlerq manages missing values in a considerably more sophisticated way, with a per-item overhead of at most 1.5 bits (and often considerably less) and consuming no storage space at all for missing values.

The public interface for missing values does not use a special "NULL" marker. Instead, there is an operator to "unset" a value. This mimics the way Tcl works: use "set" to store a value, and "unset" to remove it.

When extracting values, there is a choice: either you can extract tagged values, or you can fill in a default value for those missing in the underlying data. With the above example, extracting Kim's information as tagged values would return:

Note that there simply is no tag for the shoe size, nor any value. Whereas extracting Kim's info with "0" as default for missing shoe sizes would return:

Both mechanisms have their use. Keep in mind that this only matters when extracting data out of Vlerq - internally, missing data is properly managed.

Missing item values are not the only case of missing data in Vlerq. The other scenario where tracking missing information can be useful is at the row level.

As noted before, views are collections of rows. They are in fact ordered collections, i.e. rows have a position in the view. In programming languages such as C and Java, such collections would be called "arrays". In scripting languages such as Python, Ruby, and Tcl the term used is "lists". Other languages still call them "vectors".

The ordinal positioning means that you can take a view and insert or delete rows at a specific position in the view. The effect is that all following rows move up or down. And while rows in a view do have a position, that position can change due to insertions and deletions. This differs from rows in relational algebra relations, and in SQL relations (which does support ordering, but only in an ORDER BY clause).

Positional access to rows in a view is very convenient for day-to-day use. It makes views as convenient as in-memory arrays/lists. Row positions can often be used instead of artificially generated unique ID's, not just for acces to each row but also for referencing rows from other views.

The trouble with positions however is the fact that they can change. You either have to avoid insertions + deletions and use some sort of row-deleted marker, or you have to adjust all references to a row whenever its position is affected.

This is where missing rows can help: in Vlerq, you can keep row positions intact by only appending rows at the end of the view, and by using the "remove" operator instead of "delete" to mark entire rows as no longer existing. The difference is that "remove" does not affect the position of other rows. After being removed, all items in a row will have missing values. To be precise: there is actually a small distinction between a row of which all items have been unset and a row which has been removed, but in actual use that difference can often be ignored.

So what missing rows add, is the ability to rely on fixed row positions - an automatic and very natural id for the row, by which it can be accessed quickly. As with missing item values, missing rows consume virtually no storage space.

-jcw, 2007-04-13



2007-06-14

Created

2007-06-14

(Changed: stat)


TN01 - Technical Notes