Parse system locales in `env_preferences` #6158

Finchiedev · 2025-02-19T09:59:06Z

As per the discussion in #6028, I have created a POSIX locale parser/converter, currently hidden in the private env_preferences::parse module. This is meant to change at some point while the PR is being drafted, especially once I add support for other platforms. Once more platforms are supported, I would also like to implement universal and platform-specific APIs, as per this comment by @zbraniecki.

My current thinking on code structure is to have some distinction between platform fetch and parse code (either using modules or files), but please let me know if all platform logic should just be kept in the same file.

Of course, feedback on the code itself would be very much appreciated!

As per the discussion in unicode-org#6028, I have implemented POSIX locale parsing functionality, and a `try_convert_lossy()` function to attempt converting to a `icu_locale::Locale`. This code is currently in the private `parse` module, as a temporary solution while support for other platforms is added and the code structure is finalized.

CLAassistant · 2025-02-19T09:59:14Z

All committers have signed the CLA.

Basic testing of each edge case, error case, along with end-to-end tests using some common POSIX locales.

robertbastian

code looks great, @zbraniecki can you also take a look

utils/env_preferences/src/parse/posix.rs

robertbastian · 2025-02-19T11:07:53Z

(whoops wrong button)

Adding fixes from @roberbastian as discussed in unicode-org#6158: unicode-org#6158 (review)

@robertbastian

Adding fixes from @robertbastian as discussed in unicode-org#6158: unicode-org#6158 (review)

Finchiedev · 2025-02-19T15:07:21Z

Thanks very much for your time and feedback @robertbastian, I have implemented all the suggested changes and marked them as resolved.

zbraniecki

This looks great! Thank you for picking it up. I left comments. Feel free to extract them to new discussion topics if you think it's big enough, or we can continue in this PR for smaller back-and-forths.

utils/env_preferences/src/parse/posix.rs

zbraniecki · 2025-02-21T22:50:24Z

utils/env_preferences/src/parse/posix.rs

+
+        let mut extensions = Extensions::new();
+        let mut script = None;
+        let mut variants = vec![variant!("posix")];


question:

Why are you using Variant subtag instead of -u-va-posix?

Why are you automatically setting posix variant for that Locale? I don't fully understand the value of maintenance of this bit of information during conversion. If I start with en_US in POSIX and I use your code to convert it to ICU4X Locale, why do I end up with en-US-posix and not en-US? What's the impact of that difference? Can we avoid it?

I am using the variant subtag based on advice from @robertbastian:

Regarding -u-va-posix, this could be expressed as a Variant as well, i.e. de-posix. We actually have data in this format (en-US-posix in collator), and we don't have any data using -u-va-posix, so I think we should parse to a variant.

I have no opinion either way, so happy to do whatever is best. Same with removing the -posix variant altogether, it seems valuable to have Locales be consistent across platforms but I don't have enough experience in this area to say if that's the right call.

Might be worth noting that as a user of this library I am piping the output into fluent-langneg-rs, which I'd expect to drop the -posix variant anyways during language negotiation.

Thanks. @robertbastian - do you have any guidance on the addition of posix variant value? At most I would expect that to be optional, but I can't come up with any use case.

@hsivonen the collator seems to be the only component that has data for a -posix variant, can you weigh in here?

zbraniecki · 2025-02-21T22:50:44Z

utils/env_preferences/src/parse/posix.rs

+                    language = language!("ssy")
+                }
+                // This keeps `variants` sorted; "-posix" comes before "-valencia"
+                "valencia" => variants.push(variant!("valencia")),


question: why are you adding a variant here, and not in -u-va?

For context, the advice I received from @robertbastian:

Similar with @valencia, I've seen this as both a variant and a subdivision tag (-u-sd-valencia). We don't currently have data keyed by either.

I could have misinterpreted the meaning of "variant" in the text above, happy to change it to -u-va-valencia or -u-sd-valnecia if that would be more appropriate!

In CLDR there are display names for valencia as a variant, so I think this is correct: https://github.com/unicode-org/cldr/blob/f7cb2b5ca09cdaf651912695f93903cc35cab69c/common/main/en.xml#L1304

@zbraniecki

Based on review comments by @zbraniecki, this moves the list of known aliases into a new file `posix_aliases.rs`, and migrates `try_convert_lossy` to the new `get_bcp47_subtags_from_posix_alias` function. Also includes some style changes to use a more functional style as per the review. Review link: unicode-org#6158 (review)

@zbraniecki

As requested in this comment by @zbraniecki: unicode-org#6158 (comment)

@zbraniecki

As per review comment by @zbraniecki: unicode-org#6158 (comment)

Multi-line displaydoc strings seem to break rust-analyzer, see rust-lang/rust-analyzer#10110

Renaming as POSIX is the only platform with significant locale parsing logic - MacOS and Windows use BCP-47 identifiers natively (more or less). The `cfg`s still reference `linux`, but in theory this code should support any POSIX-compliant platform.

Changed to avoid ambiguity with `icu_locale::ParseError`.

…cale::ParseError` Mostly the same, except `ConversionError` tracked offsets. While those were nice to have, using `ParseError` will make cross-platform error reporting much easier - `ConversionError` was POSIX-specific.

An MVP of the cross-platform locale parsing API, mostly using existing code. There are still a lot of edge cases to be checked and documentation to be added, but this will hopefully serve as a good base to do so.

Finchiedev · 2025-02-25T11:09:55Z

I've pushed some commits that re-arrange the module structure, and created an MVP of what I'd expect env_preferences to look like API-wise. Very happy to bikeshed these changes, or remove them entirely if they'd be better suited for another PR (or shouldn't be merged at all).

I'd like to extend this API to use some kind of LocaleOptions so that users have a cross-platform method to select the category, for example LocaleOptions::Time would query LC_TIME on Linux and GlobalizationPreferences.Clocks on Windows.

@robertbastian

Thanks to advice from @robertbastian: unicode-org#6028 (reply in thread)

zbraniecki · 2025-02-25T19:30:24Z

I'm fine with that, but:

Watch out for cross-OS. POSIX categories may not translate well to others OSes.
options term in ICU4X refers to a bag of options used to customize behavior of a component. Such category is just one of those options.

You may solve it in several ways:
a) Have cross-OS behavior be "singular" locale list, but per-OS allow for retrieving per-category so that env_perferences::posix can have per-category
b) Have the categories be actually ICU4X components. "Get Locales for DateTimeFormat" is universal and may result in different behavior per-OS.

Finchiedev · 2025-02-26T04:50:16Z

Thanks @zbraniecki, should have clarified I was leaning towards the solution b) you suggested, where the categories would correspond to ICU4X components. Also, will make sure to use options idiomatically, thanks for the heads up :)

I've already drafted a table of the differences between categories on different operating systems, and am in the process of testing using real-world setups - I expect that the MacOS section in particular is wrong and needs correcting. Once I'm confident in the categories I'll start investigating what can map back to ICU4X components.

Linux	Windows	MacOS
LC_ALL LANG LANGUAGE	Languages	Preferred Languages
LC_ADDRESS
LC_COLLATE		List sort order (?)
LC_CTYPE
LC_IDENTIFICATION
LC_MEASUREMENT		Measurement system
LC_MESSAGES
LC_MONETARY	Currencies	Region
LC_NAME		List sort order (?)
LC_NUMERIC		Number format Region
LC_PAPER
LC_TELEPHONE
LC_TIME	Clocks	Date format Region
	Calendars	Calendar
	HomeGeographicRegion
	WeekStartsOn	First day of week
		Temperature

Table references:
https://man.archlinux.org/man/locale.7
https://support.apple.com/en-au/guide/mac-help/intl163/mac (https://stackoverflow.com/questions/45511458/get-user-preferred-temperature-setting-in-macos)
https://learn.microsoft.com/en-us/uwp/api/windows.system.userprofile.globalizationpreferences?view=winrt-26100#properties

Add unit tests for PosixLocale::{try_from_str, try_convert_lossy}

8dc28ea

Basic testing of each edge case, error case, along with end-to-end tests using some common POSIX locales.

robertbastian reviewed Feb 19, 2025

View reviewed changes

robertbastian requested a review from zbraniecki February 19, 2025 11:07

robertbastian marked this pull request as ready for review February 19, 2025 11:07

robertbastian requested a review from a team as a code owner February 19, 2025 11:07

robertbastian marked this pull request as draft February 19, 2025 11:07

Finchiedev added a commit to Finchiedev/icu4x that referenced this pull request Feb 19, 2025

Implement suggested changes to env_preferences::parse

12c8f4d

Adding fixes from @roberbastian as discussed in unicode-org#6158: unicode-org#6158 (review)

Implement suggested changes to env_preferences::parse

e912dfd

Adding fixes from @robertbastian as discussed in unicode-org#6158: unicode-org#6158 (review)

Finchiedev force-pushed the parse_locales branch from 12c8f4d to e912dfd Compare February 19, 2025 15:04

zbraniecki reviewed Feb 21, 2025

View reviewed changes

Finchiedev added 9 commits February 24, 2025 16:47

Add displaydoc to errors in env_preferences::parse

ed94bd4

As requested in this comment by @zbraniecki: unicode-org#6158 (comment)

Remove logging from PosixLocale::try_convert_lossy

ed232ba

As per review comment by @zbraniecki: unicode-org#6158 (comment)

Fix displaydoc crashing rust-analyzer in env_preferences::parse::posix

5b7e1e4

Multi-line displaydoc strings seem to break rust-analyzer, see rust-lang/rust-analyzer#10110

Move env_preferences::parse::posix::tests to separate file

9850129

Rename env_preferences::posix::parse::ParseError to PosixParseError

f9fa3cb

Changed to avoid ambiguity with `icu_locale::ParseError`.

Prototype cross-platform locale parsing API for env_preferences

18bed24

An MVP of the cross-platform locale parsing API, mostly using existing code. There are still a lot of edge cases to be checked and documentation to be added, but this will hopefully serve as a good base to do so.

Add known edge cases for env_preferences::WindowsLocale::try_from_str

d7b8093

Thanks to advice from @robertbastian: unicode-org#6028 (reply in thread)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parse system locales in `env_preferences` #6158

Parse system locales in `env_preferences` #6158

Finchiedev commented Feb 19, 2025

CLAassistant commented Feb 19, 2025 •

edited

Loading

robertbastian left a comment

robertbastian commented Feb 19, 2025

Finchiedev commented Feb 19, 2025

zbraniecki left a comment

zbraniecki Feb 21, 2025

Finchiedev Feb 24, 2025

zbraniecki Feb 24, 2025

robertbastian Feb 25, 2025

zbraniecki Feb 21, 2025

Finchiedev Feb 24, 2025

robertbastian Feb 25, 2025

Finchiedev commented Feb 25, 2025 •

edited

Loading

zbraniecki commented Feb 25, 2025

Finchiedev commented Feb 26, 2025

Parse system locales in env_preferences #6158

Are you sure you want to change the base?

Parse system locales in env_preferences #6158

Conversation

Finchiedev commented Feb 19, 2025

CLAassistant commented Feb 19, 2025 • edited Loading

robertbastian left a comment

Choose a reason for hiding this comment

robertbastian commented Feb 19, 2025

Finchiedev commented Feb 19, 2025

zbraniecki left a comment

Choose a reason for hiding this comment

zbraniecki Feb 21, 2025

Choose a reason for hiding this comment

Finchiedev Feb 24, 2025

Choose a reason for hiding this comment

zbraniecki Feb 24, 2025

Choose a reason for hiding this comment

robertbastian Feb 25, 2025

Choose a reason for hiding this comment

zbraniecki Feb 21, 2025

Choose a reason for hiding this comment

Finchiedev Feb 24, 2025

Choose a reason for hiding this comment

robertbastian Feb 25, 2025

Choose a reason for hiding this comment

Finchiedev commented Feb 25, 2025 • edited Loading

zbraniecki commented Feb 25, 2025

Finchiedev commented Feb 26, 2025

Parse system locales in `env_preferences` #6158

Parse system locales in `env_preferences` #6158

CLAassistant commented Feb 19, 2025 •

edited

Loading

Finchiedev commented Feb 25, 2025 •

edited

Loading