From: Steven M. Kearns <skearns@hotmail.com> - 16 Feb 1999
I have a bag of keyword->id mappings that I need to store, and retrieve by keyword. Note that this is a bag, in that there are multiple ids for one keyword.
Ok. If there are *many* ID's, a subview might be more suitable, i.e.:
items[keyword:S,ids[value:I]]
This stores keywords only once, but I wouldn't do it if there are just a few ID's per keyword.
--> I think you would suggest making a View with properties "keyword" and "id" and then creating a sorted view on "keyword".
Yes :)
My question is: I want to be able to store the sorted view on disk, so that it does not have to be recomputed every time. But I still want it to be updated when I delete or add items to the underlying view.
Well... derived views cannot persist. Two comments:
- If you are worried by startup time for setting up the sorted view,
all I can say is: try it. You might be surprised, it might be ok.
- The other way to do this is to keep this sorted yourself. In that
case, Metakit becomes a persistent vector manager for you, and you
would have to do the same as you would with any other C++ container.
In addition, I need to iterate through the sorted derived view, starting with the first item >= k. However, it appears that all of the view member functions that I saw require an exact match to k, instead of the first item >= k.
By "k" I assume you mean a keyword value, not an integer row index.
There's no function for that, you're right. The derived "select" view will deal with ranges such as k..max, but it's not efficient to set it up each time, i.e. if "k" is different every time.
I suppose I could do a binary search on the sorted derived view....
Have a look at c4_View::Search - it does binary search for you. It returns the closest match (the index of the first key >= k, I think). It works by *assuming* the view is properly sorted (derived or not).
Finally, can you give any performance guidelines, a la "a commit requires O(log N)*M steps, where M is the number of inserts or deletes you did, and N is the total number of items in the view."
Performance is very, very hard to predict in Metakit. Anything from jaw-dropping to total despair is possible...
Commit is currently related to the *total* number of subviews and the number and size of those views which have been modified. The number of inserts/deletes in a single view has almost no influence - this is caused by the columnar storage format used inside Metakit. It is a fact that doing a commit after each row change is rather expensive. On the other hand, incrementing all IDs by one and committing would be fast.
I can only repeat: performance in Metakit is often counter-intuitive. A test (two years ago) comparing sort performance with MS Jet placed Metakit at 30 times as fast in one case. There are enough tricks inside (such as using a roving gap to greatly reduce data movement in changes affecting adjacent rows), to make it nearly impossible to give general guidelines. I understand that this is not a very useful answer, but it's the best I can up with. You'll have to do a few timing tests.
The only other comment I can add is: be sure to try simple, brute force, approaches before starting to come up with smart solutions. On a Pentium II / 400, linear string scans can exceed 500,000 compares /sec.
-- JC
1999-06-02
(local)
Note: you are looking at
the snapshot of an old wiki
- much of this information
is likely to be very outdated
