A second experiment for loading XML into a MetaKit datafile
===========================================================
November, 2001
Coen Siegerink
The first experiment is still at https://www.equi4.com/metakit/xml/.
This version explores a different data format. While the above had a
generic static structure to store elements, attributes, and children,
this design does away with that and creates one view per element type.
The "mk4xml" utility is a C++ program (adapted from the first trial),
which uses James Clark's "Expat" to parse XML and load things into a
MK datafile. It depends pretty heavily on MK's ability to restructure
on the fly, as new elements with different attribute sets are scanned.
In fact, it does something which is quite tricky: it is able to load
all data an *then* fix up the datastructure before saving it to file.
For example, take input:
wow
whoopdeedoo
Mk4xml turns this into a MK data structure of the form:
a[_V:S,b:S],d[_V:S,f:S],e[_S[t:S],_V:S]
The result will be:
a[0]: _V = "", b = "c"
d[0]: _V = "wow", f = ""
d[1]: _V = "dee", f = "g"
e[0]: _S[0] = "doo" _V = "whoop"
Comments:
* the "a" view stores elements, etc
* the "d" view has two rows, missing attributes are empty strings
* the "e" view has a subview to store extra data items (here "doo")
There's one key ingredient missing in the above: structure. This is
represented by a separate view called "_R", which is of the form:
_R[pc:I,pr:I,c:I,r:I]
The meaning of these fields is:
* pc is the parent column, i.e. the N'th property in the storage
* pr is the parent row, i.e. which row is the parent element
* c is the column of the child entry
* r is the row of the child entry
Entries are in order, so a selection on a specific will return
a set of the children of element , in the order they are present.
View _R holds one entry for each non-root element, in the above example:
_R[0]: pc = 1, pr = 0, c = 2, r = 0
_R[1]: pc = 1, pr = 0, c = 3, r = 0
_R[2]: pc = 3, pr = 0, c = 2, r = 1
Since view _R is always first, the above encoding works out as:
* view "a" is column 1
* view "d" is column 2
* view "e" is column 3
As you can see, "a" has 2 children, while "d" has none. Similarly, "d"
appears as child in two parent elements, while "e" appears in one.
The number of entries of view _R will always be the total number of
elements minus one (since the root is omitted). Also, the root is by
definition always in column 1, row 0 (since that is the first real view).
Some properties of the above design are:
* efficient storage for trees with both low and high fanout
(MK datafiles will usually end up noticably more compact than XML)
* instant access to specific nodes (identified by an pair)
* binary search access speed to locate all children of an element
* linear (but fast/integer) scan to locate the parent of an element
(this might be improved by adding a hash view, perhaps on open)
* it is of course trivial to traverse all elements of a single type
* attribute names are stored once for each element type
Does this make a truly efficient storage format? I don't know yet. But
one of the things MK makes so easy is restructuring on the fly. With
some specific data structures, it will be far more efficient to design a
custom format. The key issue here is that it can be done after loading.
Another issue which has not been looked into further, is how well suited
this format is for modifications. On the other hand, generating plain
XML from this data format ought to be very simple.
There are two basic Tcl scripts included in this demo:
* "analyze.tcl" reports the structure and amount of redundancy
* "reduce.tcl" tries to optimize things by adding "lookup" views
The "analyze" script goes through all views (i.e. element collections),
and collects statistics about all values found. Due to the way mk4xml
works right now, all such values are strings when initally stored. Then,
per view, a list of columns is generated with counts of how many unique
values were found in each, and which value was the most frequent one.
In many cases, "analyze" ought to show considerable redundancy.
The "reduce" script takes this one step further and applies a heuristic
to decide whether it would be "wise" (in some vague way) to recode some
of the string values as int indexes into a separate lookup view. It
then proceeds to create those views, and to restructure the file so
columns with lots of repeated values end up as efficient int indexes.
The resulting view is written to a separate output file, and should in
most cases be smaller than the original. The reason for this is that
repeated strings end up being stored once, while compact int positions
are stored in the original view instead. The output view is in a way
"equivalent" to the original, though it adds an extra level of access
to retrieve actual strings for some fields.
For example, when recoding "name" in "ADDRESSES", the original layout:
ADDRESSES[...,name:S,..]
will be restructured to the following format:
ADDRESSES[...,name_I:I,..],_ADDRESSES_name[s:S]
One way to summarize this, is that "reduce" recodes into symbol tables.
The "reduce" script was coded merely as an example of what can be done.
There are many many optimizations one could apply to XML data once it
resides in a MK datafile. Some of these can be automated, either with
heuristic rules for general-purpose use, or by reading the DTD - which
is prime source of information on the underlying data structure, or by
simply designing a custom structure (as the first XML trial did).
One could also auto-detect the datatype, by not just scanning for the
different values but also checking whether everyhing is numeric, and
then restructuring into an :I property in MK (careful: you can't change
just the type of a property, you need to come up either with a slightly
different name, or restructure in two steps - see "reduce.tcl").
For yet another approach to managing this type of data, see Jacob Levy's
"e4Graph", which is built on top of MK but uses different, and carefully
designed data structures to store all sorts of graphs, not just trees.
This code is experimental (yah, well, you know me...). To give it a go,
you may need to compile mk4xml.cpp (a Linux build is included), and then
all it takes is:
mk4xml your_xml_file
The result will be placed in "test.dat".
If you have Tcl and Mk4tcl, or simply a build of TclKit, you can do:
analyze.tcl test.dat
(lots of output)
reduce.tcl -v test.dat
(lots more output, due to the -v)
analyze reduce.out
(different output this time, i.e. converted results)
That's it for now. I'd like to point out that MetaKit appears very well
suited for a wide range of structured data, including hierarchical trees
as XML represents. The efficiency of the CatFish disk catalogs, and the
ease with which Tcl's Virtual File System can be implemented with it, is
another indication of this. The key question will be whether one can
come up with a single (adaptive, dynamic, but uniform) structure which
works well for a huge class of XML documents.
All suggestions and comments on this topic are most welcome.
- Jean-Claude WIppler