Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse system locales in env_preferences #6158

Draft
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

Finchiedev
Copy link

As per the discussion in #6028, I have created a POSIX locale parser/converter, currently hidden in the private env_preferences::parse module. This is meant to change at some point while the PR is being drafted, especially once I add support for other platforms. Once more platforms are supported, I would also like to implement universal and platform-specific APIs, as per this comment by @zbraniecki.

My current thinking on code structure is to have some distinction between platform fetch and parse code (either using modules or files), but please let me know if all platform logic should just be kept in the same file.

Of course, feedback on the code itself would be very much appreciated!

As per the discussion in unicode-org#6028, I have implemented POSIX locale parsing
functionality, and a `try_convert_lossy()` function to attempt converting
to a `icu_locale::Locale`. This code is currently in the private `parse`
module, as a temporary solution while support for other platforms is
added and the code structure is finalized.
@CLAassistant
Copy link

CLAassistant commented Feb 19, 2025

CLA assistant check
All committers have signed the CLA.

Basic testing of each edge case, error case, along with end-to-end
tests using some common POSIX locales.
Copy link
Member

@robertbastian robertbastian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code looks great, @zbraniecki can you also take a look

@robertbastian robertbastian marked this pull request as ready for review February 19, 2025 11:07
@robertbastian robertbastian requested a review from a team as a code owner February 19, 2025 11:07
@robertbastian robertbastian marked this pull request as draft February 19, 2025 11:07
@robertbastian
Copy link
Member

(whoops wrong button)

Finchiedev added a commit to Finchiedev/icu4x that referenced this pull request Feb 19, 2025
@Finchiedev
Copy link
Author

Thanks very much for your time and feedback @robertbastian, I have implemented all the suggested changes and marked them as resolved.

Copy link
Member

@zbraniecki zbraniecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Thank you for picking it up. I left comments. Feel free to extract them to new discussion topics if you think it's big enough, or we can continue in this PR for smaller back-and-forths.


let mut extensions = Extensions::new();
let mut script = None;
let mut variants = vec![variant!("posix")];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question:

  1. Why are you using Variant subtag instead of -u-va-posix?
  2. Why are you automatically setting posix variant for that Locale? I don't fully understand the value of maintenance of this bit of information during conversion. If I start with en_US in POSIX and I use your code to convert it to ICU4X Locale, why do I end up with en-US-posix and not en-US? What's the impact of that difference? Can we avoid it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am using the variant subtag based on advice from @robertbastian:

Regarding -u-va-posix, this could be expressed as a Variant as well, i.e. de-posix. We actually have data in this format (en-US-posix in collator), and we don't have any data using -u-va-posix, so I think we should parse to a variant.

I have no opinion either way, so happy to do whatever is best. Same with removing the -posix variant altogether, it seems valuable to have Locales be consistent across platforms but I don't have enough experience in this area to say if that's the right call.

Might be worth noting that as a user of this library I am piping the output into fluent-langneg-rs, which I'd expect to drop the -posix variant anyways during language negotiation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. @robertbastian - do you have any guidance on the addition of posix variant value? At most I would expect that to be optional, but I can't come up with any use case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hsivonen the collator seems to be the only component that has data for a -posix variant, can you weigh in here?

language = language!("ssy")
}
// This keeps `variants` sorted; "-posix" comes before "-valencia"
"valencia" => variants.push(variant!("valencia")),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: why are you adding a variant here, and not in -u-va?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For context, the advice I received from @robertbastian:

Similar with @valencia, I've seen this as both a variant and a subdivision tag (-u-sd-valencia). We don't currently have data keyed by either.

I could have misinterpreted the meaning of "variant" in the text above, happy to change it to -u-va-valencia or -u-sd-valnecia if that would be more appropriate!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In CLDR there are display names for valencia as a variant, so I think this is correct: https://github.com/unicode-org/cldr/blob/f7cb2b5ca09cdaf651912695f93903cc35cab69c/common/main/en.xml#L1304

Based on review comments by @zbraniecki, this moves the list of
known aliases into a new file `posix_aliases.rs`, and migrates
`try_convert_lossy` to the new `get_bcp47_subtags_from_posix_alias`
function. Also includes some style changes to use a more functional
style as per the review.

Review link:
unicode-org#6158 (review)
Renaming as POSIX is the only platform with significant locale parsing
logic - MacOS and Windows use BCP-47 identifiers natively (more or less).
The `cfg`s still reference `linux`, but in theory this code should
support any POSIX-compliant platform.
Changed to avoid ambiguity with `icu_locale::ParseError`.
…cale::ParseError`

Mostly the same, except `ConversionError` tracked offsets. While those
were nice to have, using `ParseError` will make cross-platform error
reporting much easier - `ConversionError` was POSIX-specific.
An MVP of the cross-platform locale parsing API, mostly using existing
code. There are still a lot of edge cases to be checked and documentation
to be added, but this will hopefully serve as a good base to do so.
@Finchiedev
Copy link
Author

Finchiedev commented Feb 25, 2025

I've pushed some commits that re-arrange the module structure, and created an MVP of what I'd expect env_preferences to look like API-wise. Very happy to bikeshed these changes, or remove them entirely if they'd be better suited for another PR (or shouldn't be merged at all).

I'd like to extend this API to use some kind of LocaleOptions so that users have a cross-platform method to select the category, for example LocaleOptions::Time would query LC_TIME on Linux and GlobalizationPreferences.Clocks on Windows.

@zbraniecki
Copy link
Member

I'm fine with that, but:

  1. Watch out for cross-OS. POSIX categories may not translate well to others OSes.
  2. options term in ICU4X refers to a bag of options used to customize behavior of a component. Such category is just one of those options.

You may solve it in several ways:
a) Have cross-OS behavior be "singular" locale list, but per-OS allow for retrieving per-category so that env_perferences::posix can have per-category
b) Have the categories be actually ICU4X components. "Get Locales for DateTimeFormat" is universal and may result in different behavior per-OS.

@Finchiedev
Copy link
Author

Thanks @zbraniecki, should have clarified I was leaning towards the solution b) you suggested, where the categories would correspond to ICU4X components. Also, will make sure to use options idiomatically, thanks for the heads up :)

I've already drafted a table of the differences between categories on different operating systems, and am in the process of testing using real-world setups - I expect that the MacOS section in particular is wrong and needs correcting. Once I'm confident in the categories I'll start investigating what can map back to ICU4X components.

Linux Windows MacOS
LC_ALL
LANG
LANGUAGE
Languages Preferred Languages
LC_ADDRESS
LC_COLLATE List sort order (?)
LC_CTYPE
LC_IDENTIFICATION
LC_MEASUREMENT Measurement system
LC_MESSAGES
LC_MONETARY Currencies Region
LC_NAME List sort order (?)
LC_NUMERIC Number format
Region
LC_PAPER
LC_TELEPHONE
LC_TIME Clocks Date format
Region
Calendars Calendar
HomeGeographicRegion
WeekStartsOn First day of week
Temperature

Table references:
https://man.archlinux.org/man/locale.7
https://support.apple.com/en-au/guide/mac-help/intl163/mac (https://stackoverflow.com/questions/45511458/get-user-preferred-temperature-setting-in-macos)
https://learn.microsoft.com/en-us/uwp/api/windows.system.userprofile.globalizationpreferences?view=winrt-26100#properties

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants