khmer_syll_reorder
is a character filter that will replace obsolete, deprecated, and
variant Khmer characters and attempt to
canonically reorder Khmer orthographic syllables in running text (pre-tokenization, since
non-canonically ordered syllables can affect tokenization).
Note that this is not a full analyzer, only a character filter.
For an overview of the need for Khmer syllable reordering, see the blog post "Permuting Khmer"; for in-depth notes on the reordering algorithm and examples, see Trey's Notes on Khmer Reordering.
Briefly, the character filter removes zero-width elements,† removes duplicate elements, moves subscript Ro to be the last subscripted character, and reorders everything into the following order: base character + leftover register shifters + robat + subscript characters + dependent vowels + non-spacing diacritics + spacing diacritics.
† Zero-width elements include: zero width space (U+200B), zero width non-joiner (U+200C), zero-width joiner (U+200D), soft-hyphen (U+00AD), and invisible separator(U+2063).
The following characters are replaced or deleted using a custom Mapping Character Filter
wrapped inside khmer_syll_reorder
.
-
The deprecated independent vowel ឣ (U+17A3) is replaced with អ (U+17A2).
-
The deprecated independent vowel digraph ឤ (U+17A4) is replaced with the sequence អា (U+17A2 U+17B6).
-
The obsolete ligature ឨ (U+17A8) is replaced with the sequence ឧក (U+17A7 U+1780).
-
The independent vowel ឲ (U+17B2) is replaces as a variant of ឱ (U+17B1).
-
The often invisible inherent vowels (឴) (U+17B4) and (឵) (U+17B5), which are usually only used for special transliteration applications, are deleted.
-
The deprecated sign BATHAMASAT ៓ (U+17D3) is replaced with the sign NIKAHIT ំ (U+17C6).
-
The deprecated trigram ៘ (U+17D8) is replaced with the sequence ។ល។ (U+17D4 U+179B U+17D4).
-
The obsolete sign ATTHACAN ៝ (U+17DD) is replaced with VIRIAM ៑ (U+17DD).
index :
analysis :
analyzer :
khmer_text :
type : custom
char_filter: [khmer_syll_reorder]
tokenizer : icu_tokenizer
filter : [icu_normalizer]