Re: offset-based hash table for ASCII data

From:

Robert Klemme <shortcutter@googlemail.com>

Newsgroups:

comp.unix.programmer,comp.lang.java.programmer,comp.programming

Date:

Mon, 21 Apr 2008 00:44:00 +0200

Message-ID:

<671v9iF2mjifqU1@mid.individual.net>

On 20.04.2008 20:36, Rex Mottram wrote:

Robert Klemme wrote:

I tried to find a post of yours in the thread where you explain what
kind of data structure you need but could not find one. The term
"offset based hash table for ASCII data" is far too generic for me to
get a concrete idea. Can you elaborate a bit or show a sample XML if
it is not too large?

I don't know how many other people will be interested in reading to this
depth but for the record, here's a (long) synopsis.

Thanks for elaborating!

There are 3
elemental concepts: Activities, States, and Transactions. Activities are
operations which read some States and modify others. Every time an
Activity takes place, all affected states are logged (in a server-side
database) as a Transaction.

With affected states do you mean prior or post activity (or both)? Is
the transaction list you transfer to clients always the complete history?

Now, these Activities can be pretty slow/expensive to run. Therefore we
look for opportunities to skip them. Specifically, if a previous
Transaction was run on the identical set of input States, we can skip
the new Activity and instead log a pointer back to the previous
Transaction. This is a _really_ big win when possible.

So you need a quick match based on (activity, (input state1, ...input
stateN)), do you?

Since XML must be parsed sequentially and feeling that a DOM model would
be too expensive, we went to some lengths to define a schema which could
accommodate the above logic in a single (SAX) pass. The result was
something like the following 3 stanzas (naturally this will have lots of
angle brackets and I can't promise it will render correctly in HTML
readers):

<transactions>
  <tx id="1" ... />
  <tx id="2" ... />
    .
    .
    .
</transactions>

You mention that transactions involve state but I do not see it here.

<activities>
  <activity id="1" ... />
  <activity id="2" ... />
    .
    .
    .
</activities>

<states>
  <state id="1" ... />
  <state id="2" ... />
    .
    .
    .
</states>

This leaves out many details but the basic structure is fairly clear.
The client code parses through the Transaction section, storing a
reference to each transaction in a local hash table known as the TX
table. Then it walks through the set of activities, doing a
(necessarily) linear search for the current Activity.

How do you determine from the structure above which is the current one?
There must be something (I am) missing.

Once that is
matched it sets a "found" flag and chews through the rest of the
Activity stanza until reaching the States stanza.

Next, each State is compared vs current reality.

So you have an additional input, current state(s)?

Whenever a State
mismatch is encountered, the Transactions associated with the mismatched
State are deleted from the TX table. If we reach the end of the States
and the TX table is not empty, what remains are matches. If at any
intermediate point we find that the TX table has become empty, we can
abort the parse and just go run the Activity.

In other words: if after looking at all states the TX table is non
empty, you do not need to run the activity (and save a lot of time).

What makes this preferable to a DOM design is the opportunity to abort
the parse as soon as potential matches are exhausted; DOM would have
required a full parse followed by some memory accesses.

I find DOM rarely the best choice. Even for other types of XML
processing (that do not involve early exit) SAX is usually a much better
(faster and memory savvier) choice.

This is the current model and it works quite stably, but unfortunately
not fast enough. XML parsing proves to be the dominant time cost. Since
the Transaction stanza is small relative to the other two, which can
grow quite large, it seems quite likely that it's the linear search
aspect of the latter two stanzas that's the problem.

Thus the key requirement for the "XML replacement technology" is that it
be random access. A hash table is the obvious candidate, and that's
exactly what CDB turned out to be, but a tree-based structure such as a
red-black tree might have been fine too. Anyway, CDB turned out to be
everything I was looking for.

And apparently you solved the encoding issue. That's good to hear.

(reading CDB)

My recommendations would have gone into a similar direction: define a
binary format that satisfies your lookup requirements and create read
write code in Java and read code in C. If files are small enough you
can memory map them, otherwise you have to use an approach using
indexing techniques as CDB does. But since you found your solution we
don't need to waste more brain cycles.

Thanks again. And please let us know the outcome (aka speed
improvement) or any issues you find along the way. That way we can all
learn a bit. :-)

Kind regards

robert