Book logo xindy

A Flexible Indexing System


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Sorting marked-up entries




	Hi Chris and all others,

It took me quite a while to think about all what Chris wrote in his
mail. I was still unsatisfied by discussing user interface issues
before I had not understand the real sorting problem. For this reason
I started *not* to think about the user interface at the first place
and think about the theoretical foundations of our problem instead.

The first questions I came across were "What does mark-up actually mean
in our problem domain?", "What are possible mark-ups?" and "How do
they influence the sorting process?". I'll give a rather complex,
though real-live example of what I'm thinking of.

We have to process an author index containing names of authors from
all over the world. There may appear Chinese authors written in their
native language as well as in the Latin-like pronounciation speech
(don't know how to say in English). Imagine the Latin-like written
names appear between the other authors and the Chinese written symbols
appear at the end of the index in a separate list of Chinese authors
(which may only of interest for all people capable of reading
Chinese). The index processor must understand that an authors name is
Chinese, a natural application of a mark-up named (lang=chinese).

Chinese names are sometimes sorted based on the number of strokes that
appear in the corresponding Chinese symbol. The number of strokes is
then an additional structural mark-up. A Chinese name in our example
index may then be represented as

	(lang=chinese (strokes=15 "Liang"))

This notation I use here represents mark-up as a named node in a tree
with arbitrary many children. I'll come back to this issue later.

Another problem that arises in this context are for example, Spanish
authors. As we know from the discussion papers, the collating rules
for Spanish are different than for other languages (ie. "Ch", "Ll").
Hence, structural mark-up may also influence the collation rules of a
keyword. This is a completely new aspect in our discussion since we
have not yet considered mark-up influencing collation rules. In fact
this is a problem that has to be solved in the parsing phase of the
keyword in my opinion.

I'd say that a keyword is a tree-based recursive data structure
defined as follows:

A keyword is
 (a) a mark-up entity consisting of an identifier containing a
     sequence of keywords as its childs, or
 (b) a word, i.e. a sequence of letters.

A letter is something that has been obtained by collation rules
mapping a sequence of characters into a letter.

A possible keyword is (denoted in a list-like style):

  (no-mark-up
	"This"
	(emph "is an")
	"example"
	(bold "keyword")
  )

Here we could declare that "example" and (no-mark-up "example") are
essentially the same.

Coming back to out initial example we must sort the Chinese authors to
the very end of the index. Thus, the first sort phase (I'll stay with
this concept since it is intuitively clear from the users'
perspective) separates Chinese from non-Chinese authors. Here we have
an example for which there is a need to start sorting based on mark-up
and not on letters.

In a further sorting phase the Chinese authors must be sorted
according to the number of strokes of their symbols. In fact this
sorting is meaningless for non-Chinese keywords. Thus, the set of
meaningful sorting rules may depend on the mark-up as well, introducing
another new aspect.

I think, that these introductory examples give some idea about what
structural mark-up can be used for. It is essentially some kind of
meta-information that can be used to sort keywords.

If we agree on the view that a keyword can be represented as a tree
with letters at its leaves we see several similarities with other
application domains. Imagine the possible mark-up schemes are not
arbitrarily (lang=chinese may also appear within a strokes=15
element), we are directly lead to SGML and its DTDs and instances. My
question is, if there is a grammar on how the mark-up must be used in a
keyword, in what sense does a keyword differ from an instance of a
SGLM DTD? In other terms, what is the difference between a keyword and
a structured document? Currently, I don't see any difference. The
question of sorting a keyword is essentially the same question as
asking which one of two SGML instances is sorted before the other. I
do not address the problem of character encodings or other SGML
peculiarities, my intent is only to ask the question "How are
structured documents sorted?". And this question seems to be the
essence of all our problems.

The problem has turned now into the question, how do we sort trees,
according to the above outlined notion of tree. From the intuitive
point of view, sorting is a process done in several phases. We compare
according to some criteria and then obtain some elements that are
equal in this sense. We continue to sort according to a new criteria
until we finally obtain a total order on all elements.

If we want to sort lexicographically according to only *one* sequence
of something (i.e. numbers) we need to map a tree (in our sense) into
a sequence. How can this be accomplished? A first observation shows
that the sorting criteria for each sorting phase is something that can
be obtained by extracting and filtering certain information from the
tree and serializing this information. For example we can introduce
some kind of tree manipulation operations that remove nodes from the
tree and exchange subtrees. Exchanging subtrees could then be used to
reverse sequences (needed for French sorting rules) and so on. Thus a
tree-rewriting system can be used to transform a tree into a sequence
(i.e a root-node with a sequence of atomic values). My notion of
atomic value is not restricted to letters only--a node containing the
name of a mark-up node suffices as well (I start to write more and more
informal, since this is something I'm not yet sure about it). The idea
is to map a tree to a sequence. And to map several successive sorting
phases to several sequences that when concatenated (with some
delimiting element) would obtain a total order on the trees.

The current approach implemented in xindy is essentially that of a
string-rewriting system, that to some extend can be used to rewrite a
string, but which does not deal with structural information in a
broader sense. Additionally it does not deal with an object-oriented
view on letters as attributed objects (cf. define-letter from previous
mails). What we need is a generalization of this principle in some
form, I currently have no idea of. In fact, the attributes invented in
the define-letter declaration can all be represented as mark-up nodes
in the tree as well. There is no difference between these two views in
my opinion.

In fact, to me it seems to be hard to find semi-solutions to this
problem that address only some of the problems but not all of them.

Another aspect is that the sorting process may be structured
differently. It can be described in terms of an acyclic graph having
as is edges the specfication of the sorting process that has to be
applied.

	          lang=chinese
	o-------------->o-----+----->o strokes=1
        !		      +----->o strokes=2
        ! lang=others	      +...
	+-------------->o
			....

This is an idea that came into my mind just when I was typing. I'm not
yet sure about this aspect. But it may be a natural way of defining
sorting processes and reusing paths in the graph, which seems to be
quite useful in practice. Here another problem occurs. Are categories or
enumerations of attributes still useful as it was introduced in the
define-letter declaration?


I have not directly answered Chris' mail but I think that this mail
addresses almost all of the problems described yet. My intent is to
first open discussion for this view on sorting processes before we
continue to think about possible user interfaces. From my personal
point of view I have now better understood what the real problem and
its complexity is (I hope so:). We find ourself now quite far away
from where we initially started.


Thanks for your patience reading this :)

Bye Roger


P.S.: I'd like to thank Gabor, whom I tried to convince that keywords
      are structured documents and who did helpful comments on my ideas.
-- 
======================================================================
Roger Kehr			   kehr@iti.informatik.th-darmstadt.de
Computer Science Department          Technical University of Darmstadt