The feature set of ICU4X

Written on 2023-03-25 in 5994 words ✍️.
Part of project typho digital-typesetting

Motivation

Unicode is pretty complex. In a recent talk, I want surprised that many people seemed to believe that Unicode only specifies a set of characters and its associated encoding like UTF-16 or UTF-8. No, its complexity expands into the definition of algorithms to deal with internationalization and localization. This includes for example time and date formatting or string collation. In contrast, ICU also made an effort to provide a line layout engine, but deprecated it soon in favor of HarfBuzz. After icu4j (ICU library for Java) and icu4c (ICU library for C), the newest implementation icu4x for rust was released as 1.0 release last September.

Let us have a look at the features of the current 1.1.0 release.

Calendar

  • Published as a separate crate as well: icu_calendar

  • The crate has only a size of 60 kB, but with its 8 dependencies (cargo tree | wc -l shows 55 lines) compilation can take a while

  • The crate is mature. It provides an abstration over arbitrary calendars, wraps each calendar in a module, has a concept of data providers to allow optimizations for compile-time data, wraps basic types in custom types to provide auxiliary methods, has a comprehensible list of errors, and declares relevant enums as non-exhaustive for extensibility.

  • On the other hand, I think the design of data providers is awkward and insufficiently documented. More on that later on.

  • The following calendars are supported:

    • iso

    • gregorian

    • japanese

    • julian

    • coptic

    • indian

    • buddhist

    • ethiopian

  • The Gregorian calendar differs from the Julian calendar by the integration of leap seconds. The Julian calendar assumes a duration 365.25 days per year which leads to a stronger divergence from reality than Gregorian’s. The Gregorian calendar technically includes lunar information for the computation of Eastern and alike unlike the ISO calendar. Furthermore the “week of the year” calculation sometimes gives a difference between the Gregorian and the ISO calendar.

use icu_calendar::{types::IsoWeekday, Date};

fn main() {
  let mut date_iso = Date::try_new_iso_date(1977, 5, 13)
    .expect("Failed to initialize ISO Date instance.");

  // use the methods {year, month, day_of_month} to access the metadata
  assert_eq!(date_iso.year().number, 1977);
  assert_eq!(date_iso.month().ordinal, 5);
  assert_eq!(date_iso.day_of_month().0, 13);

  // compute data about this timestamp
  assert_eq!(date_iso.day_of_week(), IsoWeekday::Friday);
  assert_eq!(date_iso.days_in_year(), 365);
  assert_eq!(date_iso.days_in_month(), 31);
}

You can also switch to another calendar:

// Conversion into Indian calendar: 1899-02-23.
let date_indian = date_iso.to_calendar(Indian);
assert_eq!(date_indian.year().number, 1899);
assert_eq!(date_indian.month().ordinal, 2);
assert_eq!(date_indian.day_of_month().0, 23);

To represent dates as strings, we have to continue with the datetime component. Thus, we switch to the icu_datetime crate. Both are part of the icu crate. You can use cargo add icu_datetime or cargo add icu. I decided to provide the following snippets with the icu crate.

datetime

When we try to find the string representation of a datetime, we need to talk about DataProviders.

To my understanding, a data provider provides information about the structure of internationalized representation. For example, the locale definition en-u-ca-gregory carries language and calendar information. Given such a locale and a datetime value, the corresponding string representation should be retrievable. DataProviders seem to store this information. As such they try to store arbitrary, untyped information. In the best case, the data comes from CLDR (Unicode Common Local Data Repository). In the worst case, the data comes from icu_testdata. Don’t look into its implementation or API, but just follow the tutorial. Let us do it.

cargo install icu_datagen
icu4x-datagen --keys all --locales full --include-collations 'search*' --cldr-tag 'latest' --format blob --out internationalization_blob.postcard

Recognize that my blob is more complete than the one given in the tutorial (additional collations and CLDR tags). The resulting file internationalization_blob.postcard has a size of 13MB here. Now let us try to use it:

cargo add icu --features serde
cargo add icu_provider_blob

If you know the calendar at compile-time, you can pick the TypedDateTimeFormatter<C> type where C is a calendar e.g. Iso. On the other hand, DateTimeFormatter only has to know the calendar at run-time. We will use the latter:

use std::fs;

use icu::locid::{locale, Locale};
use icu::calendar::DateTime;
use icu::datetime::{DateTimeFormatter, options::length};
use icu_provider_blob::BlobDataProvider;

// locale to use
const LOCALE: Locale = locale!("ja");

fn main() {
  // configuration
  let options = length::Bag::from_date_time_style(length::Date::Long, length::Time::Medium);
  let blob = fs::read("internationalization_blob.postcard").expect("Failed to read file");
  let provider = BlobDataProvider::try_new_from_blob(blob.into_boxed_slice())
    .expect("Failed to initialize Data Provider.");

  // timestamp to use
  let timestamp = DateTime::try_new_iso_datetime(1977, 5, 13, 15, 43, 26)
    .expect("Failed to initialize ISO datetime");

  // formatter instance
  let dtf = DateTimeFormatter::try_new_with_buffer_provider(&provider, &LOCALE.into(), options.into())
    .expect("Failed to initialize DateTimeFormatter");

  println!("{}", dtf.format(&timestamp.to_any()).expect("Formatting should succeed"));
  // prints "1977年5月13日 15:43:26"
}

You can follow the tutorial to reduce the size of the blob. The locale specifies data such as language, region, and script.

timezone

Eventually, you also have to deal with timezones. In ICU4X, a formattable time zone consists of four different fields:

  • The offset from GMT: the difference stored in seconds to GMT

  • The time zone ID: ICU4X uses BCP-47 time zone IDs like “uschi” (unlike IANA time zone IDs, like “America/Chicago”)

  • The metazone ID: Several time zone IDs map to the same metazone ID dependent on a timestamp

  • The zone variant: either “dt” (daylight or summer time) or “st” (standard or winter time)

use std::fs;
use icu::calendar::DateTime;
use icu::datetime::time_zone::TimeZoneFormatterOptions;
use icu::timezone::CustomTimeZone;
use icu::timezone::MetazoneCalculator;
use icu::datetime::time_zone::TimeZoneFormatter;
use icu::datetime::{DateTimeFormatter, options::length};
use icu::locid::locale;
use icu_provider_blob::BlobDataProvider;
use icu::timezone::provider::{MetazoneId, TimeZoneBcp47Id};
use tinystr::tinystr;
use icu::timezone::GmtOffset;

fn main() {
  let blob = fs::read("internationalization_blob.postcard").expect("Failed to read file");
  let provider = BlobDataProvider::try_new_from_blob(blob.into_boxed_slice())
    .expect("Failed to initialize Data Provider.");
  let mzc = MetazoneCalculator::try_new_with_buffer_provider(&provider).unwrap();

  let tzf = TimeZoneFormatter::try_new_with_buffer_provider(
    &provider, &locale!("en").into(), TimeZoneFormatterOptions::default()
  ).unwrap();

  // timezone "gugum" corresponds to metazone "guam"
  let ref_date = DateTime::try_new_iso_datetime(1977, 5, 13, 15, 43, 26)
    .expect("Failed to initialize ISO Date instance.");

  let timezone_id = TimeZoneBcp47Id(tinystr!(8, "gugum"));
  let metazone = mzc.compute_metazone_from_time_zone(timezone_id, &ref_date);
  let expected_metazone_id = MetazoneId(tinystr!(4, "guam"));

  assert_eq!(metazone, Some(expected_metazone_id));

  // parsing and default formatting
  let timezone = "+0530".parse::<CustomTimeZone>().unwrap();
  assert_eq!(tzf.format_to_string(&timezone), "GMT+05:30");

  // more sophisticated parsing
  let tz0: CustomTimeZone = "Z".parse().expect("Failed to parse a time zone.");
  let tz1: CustomTimeZone = "+02".parse().expect("Failed to parse a time zone.");
  let tz2: CustomTimeZone = "-0230".parse().expect("Failed to parse a time zone.");
  let tz3: CustomTimeZone = "+02:30".parse().expect("Failed to parse a time zone.");

  assert_eq!(tz0.gmt_offset.map(GmtOffset::offset_seconds), Some(0));
  assert_eq!(tz1.gmt_offset.map(GmtOffset::offset_seconds), Some(7200));
  assert_eq!(tz2.gmt_offset.map(GmtOffset::offset_seconds), Some(-9000));
  assert_eq!(tz3.gmt_offset.map(GmtOffset::offset_seconds), Some(9000));
}

At this point, I dislike the fact that one has to use tinystr and cannot simply use &str.

decimal

Here, I discovered a difference between a DataLocale and a Locale. “DataLocale contains less functionality than Locale but more than LanguageIdentifier for better size and performance while still meeting the needs of the ICU4X data pipeline” says the documentation. Furthermore, I had to cargo add icu_provider, because icu_provider does not seem to be importable through icu::provider.

use std::fs;
use icu::locid::{locale, Locale};
use icu_provider::DataLocale;
use icu_provider_blob::BlobDataProvider;
use fixed_decimal::FixedDecimal;
use icu::decimal::FixedDecimalFormatter;

// locale to use
const LOCALE: Locale = locale!("en");

fn main() {
  let blob = fs::read("internationalization_blob.postcard").expect("Failed to read file");
  let provider = BlobDataProvider::try_new_from_blob(blob.into_boxed_slice())
    .expect("Failed to initialize Data Provider.");

  let fdf = FixedDecimalFormatter::try_new_with_buffer_provider(
    &provider, &DataLocale::from(LOCALE), Default::default()
  ).expect("Data should load successfully");

  let fixed_decimal = FixedDecimal::from(1000007);

  println!("{}", fdf.format_to_string(&fixed_decimal));
  // prints "১০,০০,০০৭" for locale "bn"
  // prints "1 000 007" for locale "sv"
  // prints "1.000.007" for locale "de"
  // prints "1,000,007" for locale "en"
}

Support for currencies, measurement units, and compact notation is planned. To track progress, follow issue #275.

case folding

The corresponding crate for case folding is declared experimental. Its API might change anytime. And I could not make it run. I could not find a DataProvider providing CaseMappingV1Marker. Unlike other structs, CaseMapping does not provide an appropriate constructor for my BlobDataProvider. Recognize that I don’t want to use icu_testdata which might satisfy it. But since no example is provided in the API reference, I don’t know any working solution.

plural

use std::fs;
use icu::locid::{locale, Locale};
use icu_provider_blob::BlobDataProvider;
use icu::plurals::{PluralCategory, PluralRules};

// locale to use
const LOCALE: Locale = locale!("en");

fn main() {
  let blob = fs::read("internationalization_blob.postcard").expect("Failed to read file");
  let provider = BlobDataProvider::try_new_from_blob(blob.into_boxed_slice()).unwrap();

  let pr = PluralRules::try_new_ordinal_with_buffer_provider(&provider, &LOCALE.into()).unwrap();

  for i in 0..6 {
    let fallback = format!("{}th", i);
    println!("{} apple",
      match pr.category_for(i) {
        PluralCategory::Zero => panic!("impossible for locale 'en'"),
        PluralCategory::One => "1st", // {1}
        PluralCategory::Two => "2nd", // {2}
        PluralCategory::Few => "3rd", // {3}
        PluralCategory::Many => panic!("impossible for locale 'en'"),
        PluralCategory::Other => &fallback, // {0, 4, 5}
      }
    );
  }
}

Creating the plural of a word is linguistically difficult. We cannot create the plural of a word with this API. So its usefulness is limited. But it is a step in the right direction. It gives us two layers which provide us necessary distinctions.

First, one needs to understand the layer of PluralRuleType:

Cardinal

3 doors, 1 month, 10 dollars

Ordinal

1st place, 10th day, 11th floor

This is somewhat intuitive. Speaking of English, is it an ordinal? If so, one needs to distinguish between the suffices “st”, “nd”, “rd”, and “th” for integers. Is it a cardinal? If so, the inflection influences the word itself. One needs to distinguish no suffix for quantity 1 and the suffix “s” for other quantities. Compare the words “table” and “tables”.

Second, we need to understand the layer of PluralCategory. The documentation has some interesting notes with examples for cardinals:

Zero

Arabic (ar) and Latvian (lv) have a inflection for zero quantities. Latvian also uses it for multiples of 10.

One

The singular occurs in every language, but for example Filipino (fil) uses it for {2, 3, 5, 7, 8, …} as well

Two

A form used for 2 in Arabic (ar), Hebrew (iw), and Slovenian (sl)

Few

A form used for 0 in Romanian (ro) as well as 1.2 in Croatian (hr), Romanian (ro), Slovenian (sl), and Serbian (sr) as well as 5 in Arabic (ar), Lithuanian (lt), Romanian (ro)

Many

A form used for 1.0 in Czech (cs) and Slovak (sk) as well as for 1.1 in Czech (cs) and Slovak (sk) as well as for 15 in Russian (ru) and Ukrainian (uk)

Other

A catch-all form. The only used variant for Japanese, Chinese, Korean, and Thai since they don’t use plural forms.

A third layer, where we inflect the word within a plural category is missing. It is also linguistically difficult (except for Esperanto 😅). But as a result, we can implement the following process for locale “en” and the word “table” in this API:

  • Is it a cardinal?

    • Is it in plural category One? Then add no suffix to the noun.

    • Is it in plural category Other? Then add suffix “s” to the noun.

  • Is it an ordinal?

    • Is it in plural category One? Then add suffix “st” to the integer.

    • Is it in plural category Two? Then add suffix “nd” to the integer.

    • Is it in plural category Few? Then add suffix “rd” to the integer.

    • Is it in plural category Other? Then add suffix “th” to the integer.

collation

use std::fs;
use core::cmp::Ordering;
use icu_provider_blob::BlobDataProvider;
use icu::collator::*;
use icu::locid::{locale, Locale};

// locale to use
const LOCALE: Locale = locale!("en");

fn main() {
  let blob = fs::read("internationalization_blob.postcard").expect("Failed to read file");
  let provider = BlobDataProvider::try_new_from_blob(blob.into_boxed_slice())
    .expect("Failed to initialize Data Provider.");

  let mut words = ["pollo", "polvo"];

  let locale_es: Locale = locale!("es-u-co-trad");
  let mut options = CollatorOptions::new();
  options.strength = Some(Strength::Primary);
  let collator_es: Collator = Collator::try_new_with_buffer_provider(
    &provider, &locale_es.into(), options
  ).unwrap();

  // NOTE: "pollo" > "polvo" in traditional Spanish
  words.sort();
  println!("words = {:?}", &words); // ["pollo", "polvo"] in pure rust without locale support
  words.sort_by(|a, b| collator_es.compare(a, b));
  println!("words = {:?}", &words); // ["polvo", "pollo"] in ICU4X with locale support
}

list

use std::fs;
use icu_provider_blob::BlobDataProvider;
use icu::locid::locale;
use icu::list::ListFormatter;
use icu::list::ListLength;

fn main() {
  let blob = fs::read("internationalization_blob.postcard").expect("Failed to read file");
  let provider = BlobDataProvider::try_new_from_blob(blob.into_boxed_slice())
    .expect("Failed to initialize Data Provider.");

  let list_formatter = ListFormatter::try_new_and_with_length_with_buffer_provider(
    &provider,
    &locale!("es").into(),
    ListLength::Wide,
  )
  .expect("Data should load successfully");

  println!("{:?}", list_formatter.format_to_string(["España", "Suiza"].iter()));
  // prints "España y Suiza"

  println!("{:?}", list_formatter.format_to_string(["España", "Suiza", "Italia"].iter()));
  // The Spanish 'y' sometimes becomes an 'e':
  // prints "España, Suiza e Italia"
}

In this case, the list depends on the locale and the configuration ListLength which is one of Narrow, Short, and Wide. These are explained in TR35. For example, “Jan., Feb., Mar.” is narrow unlike “Jan., Feb., and Mar.” which is short, because the conjunction is expressed explicitly.

properties and categories

Unfortunately, this crate does not work with BlobDataProvider either. You are forced to use icu_testdata. So I give up and actually use it.

The interesting idea about this API is that you can access sets of data. Specifically, the Unicode-defined sets can be read from the available load functions. These sets (type CodePointSetData) exist for binary properties and certain enumerated properties as the API explains.

On the other side, APIs that return a CodePointMapData exist for certain enumerated properties. Specifically, the default example shows that you can check the Script property for Unicode scalars:

use icu::properties::{maps, Script};

let map = maps::load_script(&icu_testdata::unstable())
  .expect("The data should be valid");
let script = map.as_borrowed();

assert_eq!(script.get('🎃'), Script::Common); // U+1F383 JACK-O-LANTERN
assert_eq!(script.get('木'), Script::Han); // U+6728

Unicode normalization

use icu_testdata;
use icu::normalizer;

fn main() -> Result<(), normalizer::NormalizerError> {
  {
    let normalizer = normalizer::ComposingNormalizer::try_new_nfc_with_any_provider(&icu_testdata::any())?;
    // X := U+0043  LATIN CAPITAL LETTER C
    // Y := U+0327  COMBINING CEDILLA
    // Z := U+00C7  LATIN CAPITAL LETTER C WITH CEDILLA
    // {X, Y} is canonically equivalent to {Z}
    let input_text = "\u{0043}\u{0327}";
    let expected_text = "\u{00C7}";
    let normalized_text = normalizer.normalize(input_text);
    assert_eq!(normalized_text, expected_text);
  }

  {
    let normalizer = normalizer::ComposingNormalizer::try_new_nfkc_with_any_provider(&icu_testdata::any())?;
    // R := 0x2460  CIRCLED DIGIT ONE
    // S := 0x31  DIGIT ONE
    // {R} is compatible-equivalent to {S}
    let input_text = "\u{2460}";
    let expected_text = "\u{0031}";
    let normalized_text = normalizer.normalize(input_text);
    assert_eq!(normalized_text, expected_text);
  }

  Ok(())
}

Unicode segmentation

The entire API of segmentation is experimental. So you need to run cargo add icu_testdata --features icu_segmenter and cargo add icu --features icu_segmenter to enable this module.

Once more, we have a dependency to icu_testdata. Since I am currently reading Unicode TR #14, my guess is that maybe the character classes of Unicode scalars are used (“the algorithm defined in Section 6, Line Breaking Algorithm also makes use of East_Asian_Width property values, defined in Unicode Standard Annex #11, East Asian Width [UAX11]”) and not hardcoded.

use icu::segmenter::LineSegmenter;
use icu::segmenter::WordSegmenter;
use icu::segmenter::SentenceSegmenter;
use icu::segmenter::GraphemeClusterSegmenter;
use icu_testdata;

fn main() {
  let line_seg = LineSegmenter::try_new_unstable(&icu_testdata::unstable()).unwrap();

  let breakpoints: Vec<usize> = line_seg.segment_str("Hello World").collect();
  assert_eq!(&breakpoints, &[6, 11]);

  let word_seg = WordSegmenter::try_new_unstable(&icu_testdata::unstable()).unwrap();
  let breakpoints: Vec<usize> = word_seg.segment_latin1(b"Hello World").collect();
  assert_eq!(&breakpoints, &[0, 5, 6, 11]);

  let sentence_seg = SentenceSegmenter::try_new_unstable(&icu_testdata::unstable()).unwrap();
  let breakpoints: Vec<usize> = sentence_seg.segment_latin1(b"Hello World").collect();
  assert_eq!(&breakpoints, &[0, 11]);

  let graph_seg = GraphemeClusterSegmenter::try_new_unstable(&icu_testdata::unstable()).unwrap();
  let breakpoints: Vec<usize> = graph_seg.segment_str("देवनागरी 🗺").collect();
  assert_eq!(&breakpoints, &[0, 6, 9, 15, 18, 24, 25, 29]);
}

Compatibility

I wonder in which way I can build the library. In general, icu4x implements all the important functionality in rust, but there seems to be some FFI code in C++ used for fallback lookups. I tried to compile the last example in all “Tier 1 with Host Tools” targets:

target compilation result

aarch64-unknown-linux-gnu

linking error: error adding symbols: file in wrong format

i686-pc-windows-gnu

success

i686-pc-windows-msvc

success until the linking step, linker link.exe not found (expected result on my x86_64-unknown-linux-gnu system)

i686-unknown-linux-gnu

success until the linking step, cannot find -lgcc: No such file or directory (I would have to install the appropriate gcc dependency)

x86_64-apple-darwin

success until the linking step, unrecognized command-line option '-arch' (honestly, I don’t know, -arch is listed as cc argument in the manual)

x86_64-pc-windows-gnu

success

x86_64-pc-windows-msvc

success until the linking step, linker link.exe not found (expected result on my x86_64-unknown-linux-gnu system)

x86_64-unknown-linux-gnu

success

wasm32-unknown-unknown

success

wasm32-wasi

compilation failure: assertion failed: pos.get() ⇐ self.position()', compiler/rustc_metadata/src/rmeta/encoder.rs:426:9 (for icu_testdata)

Okay, I admit. The final two lines are from Tier 2, but I am personally interested in WebAssembly support.

Conclusion

  • There are several confusing abstractions where one gives up and just copies code from working examples. This includes DataProvider and the DataLocale versus Locales. Certainly there was some initial motivation, but it is insufficiently documented and does not feel consistent with other concepts. DataProvider is super-abstract and seems to circumvent rust’s type system (with Any types) contrary to Locale. DataLocale seems to be a more specialized version of Locale. This seems to follow usual rust conventions with a from constructor.

  • Locale implements Into to allow conversion of a Locale instance into a DataLocale. I think this is neat, but as programmer you really lose track which kind of information is part of the Locale. I think is this restricts the applicability of the library a lot (“how do I convey to users what information they need to provide in the locale string?”).

  • Some modules can be imported with icu_calendar as well as icu::calendar since they are provided as separate crate too. But not others (icu_provider).

  • If you actually want to provide your own localization data, implementing the traits does not seem too bad. icu_testdata is very readable for rust programmers.

In the end, I would conclude, that you need an experienced rust programmer and a motivated programmer to use this library. The former, because someone needs to properly interpret errors with trait dependencies and the latter, because someone always needs to follow up-to-date news of the library. Most of the features are hidden behind unstable dependencies (feature flags, unstable keyword in constructors, experimental APIs, binary blobs which might not provide the latest data).

I do support the efforts, but I am a little disappointed. If I only care about (e.g.) line breaking, I would rather implement Unicode TR #14 on my own than struggle with this library. If your application is all about internalization, take a look.