A second experiment for loading XML into a MetaKit datafile =========================================================== November, 2001 Coen Siegerink The first experiment is still at https://www.equi4.com/metakit/xml/. This version explores a different data format. While the above had a generic static structure to store elements, attributes, and children, this design does away with that and creates one view per element type. The "mk4xml" utility is a C++ program (adapted from the first trial), which uses James Clark's "Expat" to parse XML and load things into a MK datafile. It depends pretty heavily on MK's ability to restructure on the fly, as new elements with different attribute sets are scanned. In fact, it does something which is quite tricky: it is able to load all data an *then* fix up the datastructure before saving it to file. For example, take input: wow whoopdeedoo Mk4xml turns this into a MK data structure of the form: a[_V:S,b:S],d[_V:S,f:S],e[_S[t:S],_V:S] The result will be: a[0]: _V = "", b = "c" d[0]: _V = "wow", f = "" d[1]: _V = "dee", f = "g" e[0]: _S[0] = "doo" _V = "whoop" Comments: * the "a" view stores elements, etc * the "d" view has two rows, missing attributes are empty strings * the "e" view has a subview to store extra data items (here "doo") There's one key ingredient missing in the above: structure. This is represented by a separate view called "_R", which is of the form: _R[pc:I,pr:I,c:I,r:I] The meaning of these fields is: * pc is the parent column, i.e. the N'th property in the storage * pr is the parent row, i.e. which row is the parent element * c is the column of the child entry * r is the row of the child entry Entries are in order, so a selection on a specific will return a set of the children of element , in the order they are present. View _R holds one entry for each non-root element, in the above example: _R[0]: pc = 1, pr = 0, c = 2, r = 0 _R[1]: pc = 1, pr = 0, c = 3, r = 0 _R[2]: pc = 3, pr = 0, c = 2, r = 1 Since view _R is always first, the above encoding works out as: * view "a" is column 1 * view "d" is column 2 * view "e" is column 3 As you can see, "a" has 2 children, while "d" has none. Similarly, "d" appears as child in two parent elements, while "e" appears in one. The number of entries of view _R will always be the total number of elements minus one (since the root is omitted). Also, the root is by definition always in column 1, row 0 (since that is the first real view). Some properties of the above design are: * efficient storage for trees with both low and high fanout (MK datafiles will usually end up noticably more compact than XML) * instant access to specific nodes (identified by an pair) * binary search access speed to locate all children of an element * linear (but fast/integer) scan to locate the parent of an element (this might be improved by adding a hash view, perhaps on open) * it is of course trivial to traverse all elements of a single type * attribute names are stored once for each element type Does this make a truly efficient storage format? I don't know yet. But one of the things MK makes so easy is restructuring on the fly. With some specific data structures, it will be far more efficient to design a custom format. The key issue here is that it can be done after loading. Another issue which has not been looked into further, is how well suited this format is for modifications. On the other hand, generating plain XML from this data format ought to be very simple. There are two basic Tcl scripts included in this demo: * "analyze.tcl" reports the structure and amount of redundancy * "reduce.tcl" tries to optimize things by adding "lookup" views The "analyze" script goes through all views (i.e. element collections), and collects statistics about all values found. Due to the way mk4xml works right now, all such values are strings when initally stored. Then, per view, a list of columns is generated with counts of how many unique values were found in each, and which value was the most frequent one. In many cases, "analyze" ought to show considerable redundancy. The "reduce" script takes this one step further and applies a heuristic to decide whether it would be "wise" (in some vague way) to recode some of the string values as int indexes into a separate lookup view. It then proceeds to create those views, and to restructure the file so columns with lots of repeated values end up as efficient int indexes. The resulting view is written to a separate output file, and should in most cases be smaller than the original. The reason for this is that repeated strings end up being stored once, while compact int positions are stored in the original view instead. The output view is in a way "equivalent" to the original, though it adds an extra level of access to retrieve actual strings for some fields. For example, when recoding "name" in "ADDRESSES", the original layout: ADDRESSES[...,name:S,..] will be restructured to the following format: ADDRESSES[...,name_I:I,..],_ADDRESSES_name[s:S] One way to summarize this, is that "reduce" recodes into symbol tables. The "reduce" script was coded merely as an example of what can be done. There are many many optimizations one could apply to XML data once it resides in a MK datafile. Some of these can be automated, either with heuristic rules for general-purpose use, or by reading the DTD - which is prime source of information on the underlying data structure, or by simply designing a custom structure (as the first XML trial did). One could also auto-detect the datatype, by not just scanning for the different values but also checking whether everyhing is numeric, and then restructuring into an :I property in MK (careful: you can't change just the type of a property, you need to come up either with a slightly different name, or restructure in two steps - see "reduce.tcl"). For yet another approach to managing this type of data, see Jacob Levy's "e4Graph", which is built on top of MK but uses different, and carefully designed data structures to store all sorts of graphs, not just trees. This code is experimental (yah, well, you know me...). To give it a go, you may need to compile mk4xml.cpp (a Linux build is included), and then all it takes is: mk4xml your_xml_file The result will be placed in "test.dat". If you have Tcl and Mk4tcl, or simply a build of TclKit, you can do: analyze.tcl test.dat (lots of output) reduce.tcl -v test.dat (lots more output, due to the -v) analyze reduce.out (different output this time, i.e. converted results) That's it for now. I'd like to point out that MetaKit appears very well suited for a wide range of structured data, including hierarchical trees as XML represents. The efficiency of the CatFish disk catalogs, and the ease with which Tcl's Virtual File System can be implemented with it, is another indication of this. The key question will be whether one can come up with a single (adaptive, dynamic, but uniform) structure which works well for a huge class of XML documents. All suggestions and comments on this topic are most welcome. - Jean-Claude WIppler