XINDY: Proposal

3.1 A possible solution

The locale approach can be used to describe the sorting schemes of may languages. Though, using it in conjunction with xindy has several disadvantages.

We could incorporate the GNU libc implementation of locale into xindy. But defining a new set of rules would have a significant overhead. A new table must first be created and compiled with localedef into a form readable by the library. This is in contrast to the dynamic paradigm of xindy and therefore in my opinion not feasible.
We could extract parts of the library and reuse parts of the implementation. Actually I have not yet looked into the source but I'm not sure if this helps to solve the problem.
A largely complete specification in LC_COLLATE is rather complex and in no way declarative which initially was a design goal of xindy Additionally, it solves the problem of pure alphabets but dealing with markup in the keyword is still an open issue. The markup can often not be pre-processed since the markup may only play a rule in one of the later sorting runs. It actually depends on the user's needs.

Based on these observations my current proposal is a mixture of the locale approach and the current implementation in xindy.

My proposal is to extend the sorting rule mechanism to incorporate the idea of several successive runs obtaining partial orders until a total order is obtained.
The specifiaction of the command sort-rule can be extended with an additional argument :level taking the number of the level into which the sort rule is to be put. Additionally there must be a specification on how each run is to be sorted (forward, backward). These rules may still contain regular expression substitutions which may come into consideration at any level as necessary.

I'll give an example of how powerful this approach can be:

Assuming we have the following keywords to sort:

\tt{ARM}
\it{arm}
Arm
arm
Armbrust
armselig

Taking into consideration that we want to sort case-independent at the first level of comparision this can be done with the following rule set:

A         -> a
\tt{(.*)} -> \1  :again
\it{(.*)} -> \1  :again

This obtains the following result:

equivalence class (arm): Arm, arm, \it{arm}, \tt{ARM}
equivalence class (armbrust): Armbrust
equivalence class (armselig): armselig

The intended sorting rule says that the keywords containing markup should come before the others. Thus we must define a rule set expressing this sort order:

\tt{(.*)} -> 0\1  :again
\it{(.*)} -> 1\1  :again
A         -> 2a
a         -> 2a

Now we have prefixed the letters to obtain a further relative sorting order:

equivalence class (02arm): \tt{ARM}
equivalence class (12arm): \it{ARM}
equivalence class (2arm): Arm, arm

The last step is now to obtain a total order. We do not specify any other rules. since we sort according to the position in the ISO Latin alphabet with A being before a obtaining

equivalence class (Arm): Arm
equivalence class (arm): arm

Thus, we have gradually refined the partial order into an total one. The advantage is that we are still able to use declarative descriptions such as \tt{(.*)} -> \1 to match a many keywords at once.

3.2 Open issues

I have several questions about this scheme and I'm interested in your opinion.

Are there any deficiencies in this scheme. What are the cases it does not cover?
Is this understandable for Joe Anyuser?
Must it actually be completely understandable for Joe Anyuser? (see next question)
How could a simple user front-end look like that hides most of the details from the user? (We could use Lisp's macro-mechanism to preprocess some other form of description that expands to a bunch of these primitive mapping rules presented above.)
The principle of tokenising is not explicitly defined in this scheme. To make it real good we must operate entirely on tokens (see the 2arm stuff for example which is actually only a work-around to re-incorporate tokens in a very uncleanly manner). But is it necessary to include tokens?

3. Proposal

3.1 A possible solution

3.2 Open issues