Skip to content

Latest commit

 

History

History
88 lines (70 loc) · 4.15 KB

khmer_syll_reorder.md

File metadata and controls

88 lines (70 loc) · 4.15 KB

khmer_syll_reorder

khmer_syll_reorder is a character filter that will replace obsolete, deprecated, and variant Khmer characters and attempt to canonically reorder Khmer orthographic syllables in running text (pre-tokenization, since non-canonically ordered syllables can affect tokenization).

Note that this is not a full analyzer, only a character filter.

Syllable Reordering

For an overview of the need for Khmer syllable reordering, see the blog post "Permuting Khmer"; for in-depth notes on the reordering algorithm and examples, see Trey's Notes on Khmer Reordering.

Briefly, the character filter removes zero-width elements,† removes duplicate elements, moves subscript Ro to be the last subscripted character, and reorders everything into the following order: base character + leftover register shifters + robat + subscript characters + dependent vowels + non-spacing diacritics + spacing diacritics.

† Zero-width elements include: zero width space (U+200B), zero width non-joiner (U+200C), zero-width joiner (U+200D), soft-hyphen (U+00AD), and invisible separator(U+2063).

Obsolete, Deprecated, and Variant Characters

The following characters are replaced or deleted using a custom Mapping Character Filter wrapped inside khmer_syll_reorder.

  • The deprecated independent vowel ឣ (U+17A3) is replaced with អ (U+17A2).

  • The deprecated independent vowel digraph ឤ (U+17A4) is replaced with the sequence អា (U+17A2 U+17B6).

  • The obsolete ligature ឨ (U+17A8) is replaced with the sequence ឧក (U+17A7 U+1780).

  • The independent vowel ឲ (U+17B2) is replaces as a variant of ឱ (U+17B1).

  • The often invisible inherent vowels (឴) (U+17B4) and (឵) (U+17B5), which are usually only used for special transliteration applications, are deleted.

  • The deprecated sign BATHAMASAT ៓ (U+17D3) is replaced with the sign NIKAHIT ំ (U+17C6).

  • The deprecated trigram ៘ (U+17D8) is replaced with the sequence ។ល។ (U+17D4 U+179B U+17D4).

  • The obsolete sign ATTHACAN ៝ (U+17DD) is replaced with VIRIAM ៑ (U+17DD).

Example

index :
    analysis :
        analyzer :
            khmer_text :
                type : custom
                char_filter: [khmer_syll_reorder]
                tokenizer : icu_tokenizer
                filter : [icu_normalizer]