Motivation
Unicode is pretty complex. In a recent talk, I want surprised that many people seemed to believe that Unicode only specifies a set of characters and its associated encoding like UTF-16 or UTF-8. No, its complexity expands into the definition of algorithms to deal with internationalization and localization. This includes for example time and date formatting or string collation. In contrast, ICU also made an effort to provide a line layout engine, but deprecated it soon in favor of HarfBuzz. After icu4j (ICU library for Java) and icu4c (ICU library for C), the newest implementation icu4x for rust was released as 1.0 release last September.
Let us have a look at the features of the current 1.1.0 release.
Resources
Calendar
-
Published as a separate crate as well: icu_calendar
-
The crate has only a size of 60 kB, but with its 8 dependencies (
cargo tree | wc -l
shows 55 lines) compilation can take a while -
The crate is mature. It provides an abstration over arbitrary calendars, wraps each calendar in a module, has a concept of data providers to allow optimizations for compile-time data, wraps basic types in custom types to provide auxiliary methods, has a comprehensible list of errors, and declares relevant enums as non-exhaustive for extensibility.
-
On the other hand, I think the design of data providers is awkward and insufficiently documented. More on that later on.
-
The following calendars are supported:
-
iso
-
gregorian
-
japanese
-
julian
-
coptic
-
indian
-
buddhist
-
ethiopian
-
-
The Gregorian calendar differs from the Julian calendar by the integration of leap seconds. The Julian calendar assumes a duration 365.25 days per year which leads to a stronger divergence from reality than Gregorian’s. The Gregorian calendar technically includes lunar information for the computation of Eastern and alike unlike the ISO calendar. Furthermore the “week of the year” calculation sometimes gives a difference between the Gregorian and the ISO calendar.
use icu_calendar::{types::IsoWeekday, Date};
fn main() {
let mut date_iso = Date::try_new_iso_date(1977, 5, 13)
.expect("Failed to initialize ISO Date instance.");
// use the methods {year, month, day_of_month} to access the metadata
assert_eq!(date_iso.year().number, 1977);
assert_eq!(date_iso.month().ordinal, 5);
assert_eq!(date_iso.day_of_month().0, 13);
// compute data about this timestamp
assert_eq!(date_iso.day_of_week(), IsoWeekday::Friday);
assert_eq!(date_iso.days_in_year(), 365);
assert_eq!(date_iso.days_in_month(), 31);
}
You can also switch to another calendar:
// Conversion into Indian calendar: 1899-02-23.
let date_indian = date_iso.to_calendar(Indian);
assert_eq!(date_indian.year().number, 1899);
assert_eq!(date_indian.month().ordinal, 2);
assert_eq!(date_indian.day_of_month().0, 23);
To represent dates as strings, we have to continue with the datetime
component. Thus, we switch to the icu_datetime crate. Both are part of the icu crate. You can use cargo add icu_datetime
or cargo add icu
. I decided to provide the following snippets with the icu
crate.
datetime
When we try to find the string representation of a datetime, we need to talk about DataProviders.
To my understanding, a data provider provides information about the structure of internationalized representation. For example, the locale definition en-u-ca-gregory
carries language and calendar information. Given such a locale and a datetime value, the corresponding string representation should be retrievable. DataProviders seem to store this information. As such they try to store arbitrary, untyped information. In the best case, the data comes from CLDR (Unicode Common Local Data Repository). In the worst case, the data comes from icu_testdata. Don’t look into its implementation or API, but just follow the tutorial. Let us do it.
cargo install icu_datagen
icu4x-datagen --keys all --locales full --include-collations 'search*' --cldr-tag 'latest' --format blob --out internationalization_blob.postcard
Recognize that my blob is more complete than the one given in the tutorial (additional collations and CLDR tags).
The resulting file internationalization_blob.postcard
has a size of 13MB here. Now let us try to use it:
cargo add icu --features serde
cargo add icu_provider_blob
If you know the calendar at compile-time, you can pick the TypedDateTimeFormatter<C>
type where C
is a calendar e.g. Iso
. On the other hand, DateTimeFormatter
only has to know the calendar at run-time. We will use the latter:
use std::fs;
use icu::locid::{locale, Locale};
use icu::calendar::DateTime;
use icu::datetime::{DateTimeFormatter, options::length};
use icu_provider_blob::BlobDataProvider;
// locale to use
const LOCALE: Locale = locale!("ja");
fn main() {
// configuration
let options = length::Bag::from_date_time_style(length::Date::Long, length::Time::Medium);
let blob = fs::read("internationalization_blob.postcard").expect("Failed to read file");
let provider = BlobDataProvider::try_new_from_blob(blob.into_boxed_slice())
.expect("Failed to initialize Data Provider.");
// timestamp to use
let timestamp = DateTime::try_new_iso_datetime(1977, 5, 13, 15, 43, 26)
.expect("Failed to initialize ISO datetime");
// formatter instance
let dtf = DateTimeFormatter::try_new_with_buffer_provider(&provider, &LOCALE.into(), options.into())
.expect("Failed to initialize DateTimeFormatter");
println!("{}", dtf.format(×tamp.to_any()).expect("Formatting should succeed"));
// prints "1977年5月13日 15:43:26"
}
You can follow the tutorial to reduce the size of the blob. The locale specifies data such as language, region, and script.
timezone
Eventually, you also have to deal with timezones. In ICU4X, a formattable time zone consists of four different fields:
-
The offset from GMT: the difference stored in seconds to GMT
-
The time zone ID: ICU4X uses BCP-47 time zone IDs like “uschi” (unlike IANA time zone IDs, like “America/Chicago”)
-
The metazone ID: Several time zone IDs map to the same metazone ID dependent on a timestamp
-
The zone variant: either “dt” (daylight or summer time) or “st” (standard or winter time)
use std::fs;
use icu::calendar::DateTime;
use icu::datetime::time_zone::TimeZoneFormatterOptions;
use icu::timezone::CustomTimeZone;
use icu::timezone::MetazoneCalculator;
use icu::datetime::time_zone::TimeZoneFormatter;
use icu::datetime::{DateTimeFormatter, options::length};
use icu::locid::locale;
use icu_provider_blob::BlobDataProvider;
use icu::timezone::provider::{MetazoneId, TimeZoneBcp47Id};
use tinystr::tinystr;
use icu::timezone::GmtOffset;
fn main() {
let blob = fs::read("internationalization_blob.postcard").expect("Failed to read file");
let provider = BlobDataProvider::try_new_from_blob(blob.into_boxed_slice())
.expect("Failed to initialize Data Provider.");
let mzc = MetazoneCalculator::try_new_with_buffer_provider(&provider).unwrap();
let tzf = TimeZoneFormatter::try_new_with_buffer_provider(
&provider, &locale!("en").into(), TimeZoneFormatterOptions::default()
).unwrap();
// timezone "gugum" corresponds to metazone "guam"
let ref_date = DateTime::try_new_iso_datetime(1977, 5, 13, 15, 43, 26)
.expect("Failed to initialize ISO Date instance.");
let timezone_id = TimeZoneBcp47Id(tinystr!(8, "gugum"));
let metazone = mzc.compute_metazone_from_time_zone(timezone_id, &ref_date);
let expected_metazone_id = MetazoneId(tinystr!(4, "guam"));
assert_eq!(metazone, Some(expected_metazone_id));
// parsing and default formatting
let timezone = "+0530".parse::<CustomTimeZone>().unwrap();
assert_eq!(tzf.format_to_string(&timezone), "GMT+05:30");
// more sophisticated parsing
let tz0: CustomTimeZone = "Z".parse().expect("Failed to parse a time zone.");
let tz1: CustomTimeZone = "+02".parse().expect("Failed to parse a time zone.");
let tz2: CustomTimeZone = "-0230".parse().expect("Failed to parse a time zone.");
let tz3: CustomTimeZone = "+02:30".parse().expect("Failed to parse a time zone.");
assert_eq!(tz0.gmt_offset.map(GmtOffset::offset_seconds), Some(0));
assert_eq!(tz1.gmt_offset.map(GmtOffset::offset_seconds), Some(7200));
assert_eq!(tz2.gmt_offset.map(GmtOffset::offset_seconds), Some(-9000));
assert_eq!(tz3.gmt_offset.map(GmtOffset::offset_seconds), Some(9000));
}
At this point, I dislike the fact that one has to use tinystr
and cannot simply use &str
.
decimal
Here, I discovered a difference between a DataLocale and a Locale. “DataLocale contains less functionality than Locale but more than LanguageIdentifier for better size and performance while still meeting the needs of the ICU4X data pipeline” says the documentation. Furthermore, I had to cargo add icu_provider
, because icu_provider
does not seem to be importable through icu::provider
.
use std::fs;
use icu::locid::{locale, Locale};
use icu_provider::DataLocale;
use icu_provider_blob::BlobDataProvider;
use fixed_decimal::FixedDecimal;
use icu::decimal::FixedDecimalFormatter;
// locale to use
const LOCALE: Locale = locale!("en");
fn main() {
let blob = fs::read("internationalization_blob.postcard").expect("Failed to read file");
let provider = BlobDataProvider::try_new_from_blob(blob.into_boxed_slice())
.expect("Failed to initialize Data Provider.");
let fdf = FixedDecimalFormatter::try_new_with_buffer_provider(
&provider, &DataLocale::from(LOCALE), Default::default()
).expect("Data should load successfully");
let fixed_decimal = FixedDecimal::from(1000007);
println!("{}", fdf.format_to_string(&fixed_decimal));
// prints "১০,০০,০০৭" for locale "bn"
// prints "1 000 007" for locale "sv"
// prints "1.000.007" for locale "de"
// prints "1,000,007" for locale "en"
}
Support for currencies, measurement units, and compact notation is planned. To track progress, follow issue #275.
case folding
The corresponding crate for case folding is declared experimental. Its API might change anytime. And I could not make it run. I could not find a DataProvider
providing CaseMappingV1Marker
. Unlike other structs, CaseMapping
does not provide an appropriate constructor for my BlobDataProvider
. Recognize that I don’t want to use icu_testdata
which might satisfy it. But since no example is provided in the API reference, I don’t know any working solution.
plural
use std::fs;
use icu::locid::{locale, Locale};
use icu_provider_blob::BlobDataProvider;
use icu::plurals::{PluralCategory, PluralRules};
// locale to use
const LOCALE: Locale = locale!("en");
fn main() {
let blob = fs::read("internationalization_blob.postcard").expect("Failed to read file");
let provider = BlobDataProvider::try_new_from_blob(blob.into_boxed_slice()).unwrap();
let pr = PluralRules::try_new_ordinal_with_buffer_provider(&provider, &LOCALE.into()).unwrap();
for i in 0..6 {
let fallback = format!("{}th", i);
println!("{} apple",
match pr.category_for(i) {
PluralCategory::Zero => panic!("impossible for locale 'en'"),
PluralCategory::One => "1st", // {1}
PluralCategory::Two => "2nd", // {2}
PluralCategory::Few => "3rd", // {3}
PluralCategory::Many => panic!("impossible for locale 'en'"),
PluralCategory::Other => &fallback, // {0, 4, 5}
}
);
}
}
Creating the plural of a word is linguistically difficult. We cannot create the plural of a word with this API. So its usefulness is limited. But it is a step in the right direction. It gives us two layers which provide us necessary distinctions.
First, one needs to understand the layer of PluralRuleType
:
- Cardinal
-
3 doors, 1 month, 10 dollars
- Ordinal
-
1st place, 10th day, 11th floor
This is somewhat intuitive. Speaking of English, is it an ordinal? If so, one needs to distinguish between the suffices “st”, “nd”, “rd”, and “th” for integers. Is it a cardinal? If so, the inflection influences the word itself. One needs to distinguish no suffix for quantity 1 and the suffix “s” for other quantities. Compare the words “table” and “tables”.
Second, we need to understand the layer of PluralCategory
. The documentation has some interesting notes with examples for cardinals:
- Zero
-
Arabic (ar) and Latvian (lv) have a inflection for zero quantities. Latvian also uses it for multiples of 10.
- One
-
The singular occurs in every language, but for example Filipino (fil) uses it for {2, 3, 5, 7, 8, …} as well
- Two
-
A form used for 2 in Arabic (ar), Hebrew (iw), and Slovenian (sl)
- Few
-
A form used for 0 in Romanian (ro) as well as 1.2 in Croatian (hr), Romanian (ro), Slovenian (sl), and Serbian (sr) as well as 5 in Arabic (ar), Lithuanian (lt), Romanian (ro)
- Many
-
A form used for 1.0 in Czech (cs) and Slovak (sk) as well as for 1.1 in Czech (cs) and Slovak (sk) as well as for 15 in Russian (ru) and Ukrainian (uk)
- Other
-
A catch-all form. The only used variant for Japanese, Chinese, Korean, and Thai since they don’t use plural forms.
A third layer, where we inflect the word within a plural category is missing. It is also linguistically difficult (except for Esperanto 😅). But as a result, we can implement the following process for locale “en” and the word “table” in this API:
-
Is it a cardinal?
-
Is it in plural category One? Then add no suffix to the noun.
-
Is it in plural category Other? Then add suffix “s” to the noun.
-
-
Is it an ordinal?
-
Is it in plural category One? Then add suffix “st” to the integer.
-
Is it in plural category Two? Then add suffix “nd” to the integer.
-
Is it in plural category Few? Then add suffix “rd” to the integer.
-
Is it in plural category Other? Then add suffix “th” to the integer.
-
collation
use std::fs;
use core::cmp::Ordering;
use icu_provider_blob::BlobDataProvider;
use icu::collator::*;
use icu::locid::{locale, Locale};
// locale to use
const LOCALE: Locale = locale!("en");
fn main() {
let blob = fs::read("internationalization_blob.postcard").expect("Failed to read file");
let provider = BlobDataProvider::try_new_from_blob(blob.into_boxed_slice())
.expect("Failed to initialize Data Provider.");
let mut words = ["pollo", "polvo"];
let locale_es: Locale = locale!("es-u-co-trad");
let mut options = CollatorOptions::new();
options.strength = Some(Strength::Primary);
let collator_es: Collator = Collator::try_new_with_buffer_provider(
&provider, &locale_es.into(), options
).unwrap();
// NOTE: "pollo" > "polvo" in traditional Spanish
words.sort();
println!("words = {:?}", &words); // ["pollo", "polvo"] in pure rust without locale support
words.sort_by(|a, b| collator_es.compare(a, b));
println!("words = {:?}", &words); // ["polvo", "pollo"] in ICU4X with locale support
}
list
use std::fs;
use icu_provider_blob::BlobDataProvider;
use icu::locid::locale;
use icu::list::ListFormatter;
use icu::list::ListLength;
fn main() {
let blob = fs::read("internationalization_blob.postcard").expect("Failed to read file");
let provider = BlobDataProvider::try_new_from_blob(blob.into_boxed_slice())
.expect("Failed to initialize Data Provider.");
let list_formatter = ListFormatter::try_new_and_with_length_with_buffer_provider(
&provider,
&locale!("es").into(),
ListLength::Wide,
)
.expect("Data should load successfully");
println!("{:?}", list_formatter.format_to_string(["España", "Suiza"].iter()));
// prints "España y Suiza"
println!("{:?}", list_formatter.format_to_string(["España", "Suiza", "Italia"].iter()));
// The Spanish 'y' sometimes becomes an 'e':
// prints "España, Suiza e Italia"
}
In this case, the list depends on the locale and the configuration ListLength
which is one of Narrow
, Short
, and Wide
.
These are explained in TR35.
For example, “Jan., Feb., Mar.” is narrow unlike “Jan., Feb., and Mar.” which is short, because the conjunction is expressed explicitly.
properties and categories
Unfortunately, this crate does not work with BlobDataProvider
either. You are forced to use icu_testdata
. So I give up and actually use it.
The interesting idea about this API is that you can access sets of data. Specifically, the Unicode-defined sets can be read from the available load
functions. These sets (type CodePointSetData
) exist for binary properties and certain enumerated properties as the API explains.
On the other side, APIs that return a CodePointMapData
exist for certain enumerated properties. Specifically, the default example shows that you can check the Script property for Unicode scalars:
use icu::properties::{maps, Script};
let map = maps::load_script(&icu_testdata::unstable())
.expect("The data should be valid");
let script = map.as_borrowed();
assert_eq!(script.get('🎃'), Script::Common); // U+1F383 JACK-O-LANTERN
assert_eq!(script.get('木'), Script::Han); // U+6728
Unicode normalization
use icu_testdata;
use icu::normalizer;
fn main() -> Result<(), normalizer::NormalizerError> {
{
let normalizer = normalizer::ComposingNormalizer::try_new_nfc_with_any_provider(&icu_testdata::any())?;
// X := U+0043 LATIN CAPITAL LETTER C
// Y := U+0327 COMBINING CEDILLA
// Z := U+00C7 LATIN CAPITAL LETTER C WITH CEDILLA
// {X, Y} is canonically equivalent to {Z}
let input_text = "\u{0043}\u{0327}";
let expected_text = "\u{00C7}";
let normalized_text = normalizer.normalize(input_text);
assert_eq!(normalized_text, expected_text);
}
{
let normalizer = normalizer::ComposingNormalizer::try_new_nfkc_with_any_provider(&icu_testdata::any())?;
// R := 0x2460 CIRCLED DIGIT ONE
// S := 0x31 DIGIT ONE
// {R} is compatible-equivalent to {S}
let input_text = "\u{2460}";
let expected_text = "\u{0031}";
let normalized_text = normalizer.normalize(input_text);
assert_eq!(normalized_text, expected_text);
}
Ok(())
}
Unicode segmentation
The entire API of segmentation is experimental. So you need to run cargo add icu_testdata --features icu_segmenter
and cargo add icu --features icu_segmenter
to enable this module.
Once more, we have a dependency to icu_testdata
. Since I am currently reading Unicode TR #14, my guess is that maybe the character classes of Unicode scalars are used (“the algorithm defined in Section 6, Line Breaking Algorithm also makes use of East_Asian_Width property values, defined in Unicode Standard Annex #11, East Asian Width [UAX11]”) and not hardcoded.
use icu::segmenter::LineSegmenter;
use icu::segmenter::WordSegmenter;
use icu::segmenter::SentenceSegmenter;
use icu::segmenter::GraphemeClusterSegmenter;
use icu_testdata;
fn main() {
let line_seg = LineSegmenter::try_new_unstable(&icu_testdata::unstable()).unwrap();
let breakpoints: Vec<usize> = line_seg.segment_str("Hello World").collect();
assert_eq!(&breakpoints, &[6, 11]);
let word_seg = WordSegmenter::try_new_unstable(&icu_testdata::unstable()).unwrap();
let breakpoints: Vec<usize> = word_seg.segment_latin1(b"Hello World").collect();
assert_eq!(&breakpoints, &[0, 5, 6, 11]);
let sentence_seg = SentenceSegmenter::try_new_unstable(&icu_testdata::unstable()).unwrap();
let breakpoints: Vec<usize> = sentence_seg.segment_latin1(b"Hello World").collect();
assert_eq!(&breakpoints, &[0, 11]);
let graph_seg = GraphemeClusterSegmenter::try_new_unstable(&icu_testdata::unstable()).unwrap();
let breakpoints: Vec<usize> = graph_seg.segment_str("देवनागरी 🗺").collect();
assert_eq!(&breakpoints, &[0, 6, 9, 15, 18, 24, 25, 29]);
}
Compatibility
I wonder in which way I can build the library. In general, icu4x
implements all the important functionality in rust, but there seems to be some FFI code in C++ used for fallback lookups. I tried to compile the last example in all “Tier 1 with Host Tools” targets:
target | compilation result |
---|---|
|
linking error: |
|
success |
|
success until the linking step, |
|
success until the linking step, |
|
success until the linking step, |
|
success |
|
success until the linking step, |
|
success |
|
success |
|
compilation failure: |
Okay, I admit. The final two lines are from Tier 2, but I am personally interested in WebAssembly support.
Conclusion
-
There are several confusing abstractions where one gives up and just copies code from working examples. This includes
DataProvider
and theDataLocale
versus Locales. Certainly there was some initial motivation, but it is insufficiently documented and does not feel consistent with other concepts.DataProvider
is super-abstract and seems to circumvent rust’s type system (with Any types) contrary to Locale.DataLocale
seems to be a more specialized version ofLocale
. This seems to follow usual rust conventions with afrom
constructor. -
Locale
implementsInto
to allow conversion of aLocale
instance into aDataLocale
. I think this is neat, but as programmer you really lose track which kind of information is part of the Locale. I think is this restricts the applicability of the library a lot (“how do I convey to users what information they need to provide in the locale string?”). -
Some modules can be imported with
icu_calendar
as well asicu::calendar
since they are provided as separate crate too. But not others (icu_provider
). -
If you actually want to provide your own localization data, implementing the traits does not seem too bad.
icu_testdata
is very readable for rust programmers.
In the end, I would conclude, that you need an experienced rust programmer and a motivated programmer to use this library. The former, because someone needs to properly interpret errors with trait dependencies and the latter, because someone always needs to follow up-to-date news of the library. Most of the features are hidden behind unstable dependencies (feature flags, unstable
keyword in constructors, experimental APIs, binary blobs which might not provide the latest data).
I do support the efforts, but I am a little disappointed. If I only care about (e.g.) line breaking, I would rather implement Unicode TR #14 on my own than struggle with this library. If your application is all about internalization, take a look.