Re: Discussion about international sorting order

To: xindy@iti.informatik.th-darmstadt.de
Subject: Re: Discussion about international sorting order
From: Roger Kehr <kehr@iti.informatik.th-darmstadt.de>
Date: Thu, 20 Mar 1997 15:01:11 +0100 (MEZ)
In-Reply-To: <199703191803.SAA05071@fell.open.ac.uk> from "Chris Rowley" at Mar 19, 97 07:04:44 pm
Reply-To: xindy@iti.informatik.th-darmstadt.de
Sender: owner-xindy@iti.informatik.th-darmstadt.de


Chris wrote

> 1.  This concept of attaching properties to letters (or rather defining
> letters to be such objects) seems to me a good solution to the
> following common (in many senses) requirement of sorting real
> languages:
>
>   a multi-level sort in which levels are defined by such things as:
>
>     case; accents; other diacritics
>
> One small point about the name used: `partial-order' implies a
> different structure.  These orders are total orders on different sets,
> so I would call them sub-orders.

That's correct. I already noticed this yesterday.

> One practical point about the example:  I think that the :accent
> property and "sub-ordering" needs to include the value `none'.

Yes. I have not given a completely correct specification. I just
wanted to present the main ideas in a few paragraph.

> One question:  you put this into your define-total-order example:
>
>    (:accent backwards)
>
> but you did not specify what values are allowed instead of backwards
> (or what it means, but I think I can work that out:-).

The ISO standard specifies `forward', `backward' and `position'.
`forward' simply says that we we sort according to a lexicographic
order from left to right, `backwards' resp. from right to left
(necessary for the French sorting rules for example). `position' is
something that deals with ignorable letters in a lexicographic
comparison phase. The ISO standard allows to ignore letters in a run,
for example to remove special characters (the `-' in the following
example) from the keyword.

--- snip
4. The fourth decomposition breaks the final tie that does not
correspond to any tradition, the tie due to quasi-homographs that
differ only because they contain special characters. Breaking this tie
is essential to ensure the absolute predictibility of sorts and also
to be able to sort strings composed only of special characters. Since
the traces of special characters were removed from the original data
to form the three first orders of decomposition, simply putting them
in row in the fourth order of decomposition would mean that their
position would be lost. These positions are quite important to solve
remaining ties and in consequence we must retain here the original
positions of these special characters: two quasi-homographs could each
contain a common special character in different positions and thus be
strictly different (ex.:"ab*cd" is still different from "a*bcd"
despite they share one and only one common special character).

Example: to have the following order: "coop", "co-op", "coop-" numbers
could be assigned respectively according to the following pattern:
"d", "d3-" and "d5-", where "d" is an always-present delimiter that
separates this decomposition from the first three in case all four
decompositions are to be concatenated to form a single sorting key
based on numeric values (see discussion in the next paragraph). "3-"
means a dash in position 3 of the original string. "5-" means a dash
in position 5, and so on.
--- snap

This essentially says to make "coop", "co-op", "coop-" `equal' in the
first runs but `position' allows to break the tie in the next phase.

I don't know of enough other languages that need further processing
rules for a sub-order. But it seems that at least for all European
languages this scheme is enough.


> 2.  It is not clear to me that this approach will directly support
> other common requirements, such as the sub-orders required in sorting
> German, so that if  u-umlaut  and  ue  have been merged at the top-level,
> the order is defined for two words that are identical except that one
> has  u-umlaut  whereas the other has  ue .
>
> This could be done with yet another property of the letter class, called
>
>   :umlaut-oder-e
>
> having values:
>
>   umlaut  e  none  irrelevant
>
> (the last being used for all letters that never take umlauts) but such
> an approach begins to get messy.

This is still an open problem for which I have not found a satisfying
solution, yet. One would be to map

	\"u  ->  <u> <e (:umlaut-oder-e umlaut)>
	u    ->  <u>

and comparison of

	<u> <e>
	<u> <e (:umlaut-oder-e umlaut)>

would yield the correct order. But as you say, this is really messy.
But the problem is, that this does not really fit into the
letter-by-letter comparision approach.


> 3.  You probably expect me to come up with a general solution...well I
> guess the counter to that is some questions:
>
>   is there a special collection of merge-rules that come from
>   real-world multi-level sorting rules?
>
>   do these lead to a reasonable collection of letter-properties that
>   need to be added to support specification of these rules?

Good questions. I'll think about it. Actually, I have not enough
experience with other languages to be able to discover a more general
approach.


> 4.
> > Still missing is a appropriate mapping that transforms a string (a
> > sequence of chars) into a sequence of letters (which have become real
> > objects now).
> >
> > This could look like:
> >
> > 	(define-mapping "umlaut-u" ("\~"u" "=FC"))
> > 	(define-mapping "umlaut-A" ("\~"A" "=C4"))
> >
> > [I hope you can see the ISO-Latin chars as well]
> Well, I can see \374 in my emacs, will that do?:-)
>
> But I do not understand the syntax you are using here.

It defines a mapping from

	\~"u			->  "umlaut-u"
	ü (ISO Lat. char)	->  "umlaut-u"

> 5.
> > What I was just discussing with Gabor is the problem of markup (once
> > more). Often indexes contain commands such as "\index" (see for
> > example the LaTeX Companion) for with different index entries must be
> > specified for the command "\index" and the word "index" sorted as
> >
>
> This is a very ad hoc solution to what is probably an example of a
> more general class of sorting rules, in which words (ie the things to
> be indexed) have "types".

Indeed, it seems that this requires for objects of type `word'.
composed of a sequence of letters. As well this is still an open
question.

> This one could be done by a merge-rule that "ignores the \" and a
> sub-order that reinserts it (at the beginning or the end).
>
> Or it could be done by a letter property ":backslash-before"
> with values:  yes  no

It could, but in my opinion this scheme is not flexibe and general
enough. I was actually looking for a more general scheme, that allows
to define arbitrary specifications of this kind.

> I hope this helps, there are certainly still a lot of things to
> discover and to discuss here, I suspect that the ISO document does not
> cover all the sorting requirements of complex documents.

That was one of my observations, too. The standard deals much with
implementation details and may serve as a good solution for many
purposes, but in the document-processing domain it is not flexible
enough and if you have ever read a such an ISO specification table,
you'll soon get lost. I think viewing letters (and probably words) as
objects with properties is a better approach how the sorting problem
can be formally specified.

Unfortunately, I'm currently short of time. Next week I have an exam
and I think I'll continue to work on this problem afterwards.

Thanks for you helpful comments.

CU

--
======================================================================
Roger Kehr			   kehr@iti.informatik.th-darmstadt.de
Computer Science Department          Technical University of Darmstadt

References:
- Re: Discussion about international sorting order
  - From: Chris Rowley <C.A.Rowley@open.ac.uk>

Prev by Date: Re: Discussion about international sorting order
Next by Date: new version of xindy.sty
Prev by thread: Re: Discussion about international sorting order
Index(es):
- Date
- Thread