-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parse system locales in env_preferences
#6158
base: main
Are you sure you want to change the base?
Conversation
As per the discussion in unicode-org#6028, I have implemented POSIX locale parsing functionality, and a `try_convert_lossy()` function to attempt converting to a `icu_locale::Locale`. This code is currently in the private `parse` module, as a temporary solution while support for other platforms is added and the code structure is finalized.
Basic testing of each edge case, error case, along with end-to-end tests using some common POSIX locales.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code looks great, @zbraniecki can you also take a look
(whoops wrong button) |
Adding fixes from @roberbastian as discussed in unicode-org#6158: unicode-org#6158 (review)
Adding fixes from @robertbastian as discussed in unicode-org#6158: unicode-org#6158 (review)
12c8f4d
to
e912dfd
Compare
Thanks very much for your time and feedback @robertbastian, I have implemented all the suggested changes and marked them as resolved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! Thank you for picking it up. I left comments. Feel free to extract them to new discussion topics if you think it's big enough, or we can continue in this PR for smaller back-and-forths.
|
||
let mut extensions = Extensions::new(); | ||
let mut script = None; | ||
let mut variants = vec![variant!("posix")]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question
:
- Why are you using Variant subtag instead of
-u-va-posix
? - Why are you automatically setting posix variant for that Locale? I don't fully understand the value of maintenance of this bit of information during conversion. If I start with
en_US
in POSIX and I use your code to convert it to ICU4X Locale, why do I end up withen-US-posix
and noten-US
? What's the impact of that difference? Can we avoid it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am using the variant subtag based on advice from @robertbastian:
Regarding
-u-va-posix
, this could be expressed as aVariant
as well, i.e.de-posix
. We actually have data in this format (en-US-posix
in collator), and we don't have any data using-u-va-posix
, so I think we should parse to a variant.
I have no opinion either way, so happy to do whatever is best. Same with removing the -posix
variant altogether, it seems valuable to have Locale
s be consistent across platforms but I don't have enough experience in this area to say if that's the right call.
Might be worth noting that as a user of this library I am piping the output into fluent-langneg-rs, which I'd expect to drop the -posix
variant anyways during language negotiation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. @robertbastian - do you have any guidance on the addition of posix
variant value? At most I would expect that to be optional, but I can't come up with any use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hsivonen the collator seems to be the only component that has data for a -posix
variant, can you weigh in here?
language = language!("ssy") | ||
} | ||
// This keeps `variants` sorted; "-posix" comes before "-valencia" | ||
"valencia" => variants.push(variant!("valencia")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question
: why are you adding a variant here, and not in -u-va
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For context, the advice I received from @robertbastian:
Similar with
@valencia
, I've seen this as both a variant and a subdivision tag (-u-sd-valencia
). We don't currently have data keyed by either.
I could have misinterpreted the meaning of "variant" in the text above, happy to change it to -u-va-valencia
or -u-sd-valnecia
if that would be more appropriate!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In CLDR there are display names for valencia
as a variant, so I think this is correct: https://github.com/unicode-org/cldr/blob/f7cb2b5ca09cdaf651912695f93903cc35cab69c/common/main/en.xml#L1304
Based on review comments by @zbraniecki, this moves the list of known aliases into a new file `posix_aliases.rs`, and migrates `try_convert_lossy` to the new `get_bcp47_subtags_from_posix_alias` function. Also includes some style changes to use a more functional style as per the review. Review link: unicode-org#6158 (review)
As requested in this comment by @zbraniecki: unicode-org#6158 (comment)
As per review comment by @zbraniecki: unicode-org#6158 (comment)
Multi-line displaydoc strings seem to break rust-analyzer, see rust-lang/rust-analyzer#10110
Renaming as POSIX is the only platform with significant locale parsing logic - MacOS and Windows use BCP-47 identifiers natively (more or less). The `cfg`s still reference `linux`, but in theory this code should support any POSIX-compliant platform.
Changed to avoid ambiguity with `icu_locale::ParseError`.
…cale::ParseError` Mostly the same, except `ConversionError` tracked offsets. While those were nice to have, using `ParseError` will make cross-platform error reporting much easier - `ConversionError` was POSIX-specific.
An MVP of the cross-platform locale parsing API, mostly using existing code. There are still a lot of edge cases to be checked and documentation to be added, but this will hopefully serve as a good base to do so.
I've pushed some commits that re-arrange the module structure, and created an MVP of what I'd expect I'd like to extend this API to use some kind of |
I'm fine with that, but:
You may solve it in several ways: |
Thanks @zbraniecki, should have clarified I was leaning towards the solution b) you suggested, where the categories would correspond to ICU4X components. Also, will make sure to use I've already drafted a table of the differences between categories on different operating systems, and am in the process of testing using real-world setups - I expect that the MacOS section in particular is wrong and needs correcting. Once I'm confident in the categories I'll start investigating what can map back to ICU4X components.
Table references: |
As per the discussion in #6028, I have created a POSIX locale parser/converter, currently hidden in the private
env_preferences::parse
module. This is meant to change at some point while the PR is being drafted, especially once I add support for other platforms. Once more platforms are supported, I would also like to implement universal and platform-specific APIs, as per this comment by @zbraniecki.My current thinking on code structure is to have some distinction between platform
fetch
andparse
code (either using modules or files), but please let me know if all platform logic should just be kept in the same file.Of course, feedback on the code itself would be very much appreciated!