Book logo xindy

A Flexible Indexing System


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Sorting marked-up entries



Roger (and everyone)

I agree with your description of keywords as trees, and that they can
therefore be described as instances of SGML DTDs.  So let us see where
we can go from that analysis.

To be more specific, all the keywords to be sorted are instances of
one DTD.

You then therefore ask:

"How are structured documents sorted?"

To which the answer is that, in the normal everyday sense of users of
structured documents, collections of structured documents are not
sorted; this is of course not strictly true, but when they are sorted
(eg in a catalogue of documents) only a small part (or none) of the
full document is needed.

But we need to sort keywords.  However, it does not necessarily follow
that we need a theory of sorting structured documents.  There are two
ways in which our problem may differ from this general theory:

1.  We may not need to deal with any possible DTD.  SGML allows very
complex DTDs and it could be that the DTDs needed for the structure of
keywords form a subclass.  This could make our problem easier.

2.  We are not (at least to begin with) interested in general
orderings of keywords but only in ones that make sense for a
"human-usable index".  To illustrate what I mean here, suppose that we
were not aiming at "human-usability" but were only ordering for
machine readability: we are then in familiar territory where there is a
well-developed theory (hash-tables etc) but it is useless for our
purposes.  This could make the problem easier but I
suspect that it makes it harder: for a start we need to know what
"human-usable" means for an index (and, of course, eventually we
need to consider the usability of xindy to produce such an index).

Thus, whilst you have not answered my questions, you have convinced me
that they are still worth asking even if we need to ask more general
questions later.  But maybe they should be put into the context of
first deciding what sort of orderings are useful for human-readable
indexes?

Maybe I am saying that we need to address exactly what you find hard:

> In fact, to me it seems to be hard to find semi-solutions to this
> problem that address only some of the problems but not all of them.

and I am suggesting one way (I hope there are others) to start down
this hard road.

I think that you will have to expand this bit before I can comment
further:

> Another aspect is that the sorting process may be structured
> differently. It can be described in terms of an acyclic graph having
> as is edges the specification of the sorting process that has to be
> applied.
>
>                   lang=chinese
>         o-------------->o-----+----->o strokes=1
>         !                     +----->o strokes=2
>         ! lang=others         +...
>         +-------------->o
>                         ....
>
> This is an idea that came into my mind just when I was typing. I'm not
> yet sure about this aspect. But it may be a natural way of defining
> sorting processes and reusing paths in the graph, which seems to be
> quite useful in practice. Here another problem occurs. Are
> categories or
> enumerations of attributes still useful as it was introduced in the
> define-letter declaration?

it sounds very interesting but I am unsure what are the constituents
of your acyclic graphs.

I had thought about bringing in the sorting of ideographs (there are
various methods) but I decided to stick with alphabetical words since
I have a better intuition as to what is needed in practice for these.

It would certainly be useful to ask people who know about
non-alphabetic writing systems whether the methods for sorting them do
introduce yet more concepts that we have not considered: I am not sure
whether I hope they do or hope they do not:-).

Is there anyone out there listening who can help us here?

Best wishes


chrisRoger and everyone,

I agree with your description of keywords as trees, and that they can
therefore be described as instances of SGML DTDs.  So let us see where
we can go from that analysis.

To be more specific, all the keywords to be sorted are instances of
one DTD.

You then therefore ask:

"How are structured documents sorted?"

To which the answer is that, in the normal everyday sense of users of
structured documents, collections of structured documents are not
sorted; this is of course not strictly true, but when they are sorted
(eg in a catalogue of documents) only a small part (or none) of the
full document is needed.

But we need to sort keywords.  However, it does not necessarily follow
that we need a theory of sorting structured documents.  There are two
ways in which our problem may differ from this general theory:

1.  We may not need to deal with any possible DTD.  SGML allows very
complex DTDs and it could be that the DTDs needed for the structure of
keywords form a subclass.  This could make our problem easier.

2.  We are not (at least to begin with) interested in general
orderings of keywords but only in ones that make sense for a
"human-usable index".  To illustrate what I mean here, suppose that we
were not aiming at "human-usability" but were only ordering for
machine readability: we are then in familiar territory where there is a
well-developed theory (hash-tables etc) but it is useless for our
purposes.  This could make the problem easier but I
suspect that it makes it harder: for a start we need to know what
"human-usable" means for an index (and, of course, eventually we
need to consider the usability of xindy to produce such an index).

Thus, whilst you have not answered my questions, you have convinced me
that they are still worth asking even if we need to ask more general
questions later.  But maybe they should be put into the context of
first deciding what sort of orderings are useful for human-readable
indexes?

Maybe I am saying that we need to address exactly what you find hard:

> In fact, to me it seems to be hard to find semi-solutions to this
> problem that address only some of the problems but not all of them.

and I am suggesting one way (I hope there are others) to start down
this hard road.

I think that you will have to expand this bit before I can comment
further:

> Another aspect is that the sorting process may be structured
> differently. It can be described in terms of an acyclic graph having
> as is edges the specification of the sorting process that has to be
> applied.
>
>                   lang=chinese
>         o-------------->o-----+----->o strokes=1
>         !                     +----->o strokes=2
>         ! lang=others         +...
>         +-------------->o
>                         ....
>
> This is an idea that came into my mind just when I was typing. I'm not
> yet sure about this aspect. But it may be a natural way of defining
> sorting processes and reusing paths in the graph, which seems to be
> quite useful in practice. Here another problem occurs. Are
> categories or
> enumerations of attributes still useful as it was introduced in the
> define-letter declaration?

it sounds very interesting but I am unsure what are the constituents
of your acyclic graphs.

I had thought about bringing in the sorting of ideographs (there are
various methods) but I decided to stick with alphabetical words since
I have a better intuition as to what is needed in practice for these.

It would certainly be useful to ask people who know about
non-alphabetic writing systems whether the methods for sorting them do
introduce yet more concepts that we have not considered: I am not sure
whether I hope they do or hope they do not:-).

Is there anyone out there listening who can help us here?

Best wishes


chris