opstr release 1.0

Written on 2024-04-13 in 1758 words ✍️.
Part of project typho

Motivation

I was sitting at university and implemented a lot of post-quantum cryptography stuff. My usecase on my computer always boiled down to similar questions: is this integer a prime number? Which result does the extended Euclidean algorithm provide for these integers? I started to write a small python tool “opint”, which I ported later to rust to solve this. But I never finished the design, because I stopped working in cryptography.

Today, I work with strings (in the Unicode sense). And I have similar usecases. A shell is probably the fastest accessible interface for me on my computer and bash (or let us say ‘any POSIX shell’) has shitty conventions:

$ TEXT="My name is Bond … James Bond"
$ echo ${TEXT//Bond/Gosling}
My name is Gosling … James Gosling

Ok, that’s nice, but did you remember that you have to use slashes for these substitutions? And did you remember in which positions you had to use how many slashes? And do you remember how the escaping rules work if you have to use a slash in your search string? I think this can be done with a simpler command-line interface.

Let me introduce “opstr”.

The project

One week ago, I gave a lighting talk at Grazer Linuxtage presenting the new 1.0 release. The feedback from the GLT24 audience lead to the development of the current 1.1 release. For release management, I maintain a semver-based system. In general, I implemented opstr as part of my larger project typho.

Resources:

Usage

(Remark, opstr has colorized output - in this blogpost it is not colorized)

$ opstr --version
opstr 1.1.0

If you download a release and build it, the executable shows the version with --version.

Now, the idea is that you throw some strings to the executable and it will apply operations (abbr. op) on those strings:

$ opstr string
----- format -----------------------------------------------------------
string
----- utf8-bytes -------------------------------------------------------
[ 115
| 116
| 114
| 105
| 110
| 103
]


[…]

----- join -------------------------------------------------------------

----- count-grapheme-clusters ------------------------------------------
6
----- is-lf-lineterminated ---------------------------------------------
true
----- is-crlf-lineterminated -------------------------------------------
true

As you can see, count-grapheme-clusters is one of the ops. It counts the number of grapheme clusters as defined by Unicode.

The ops are listed from “most useful” to “least useful”. “format” (string formatting with placeholders like {:<20}) is considered most useful, but I admit that it is not perfect. The list of utf8-bytes might be more interesting for this string. But is subjective anyways and what do these ops actually do? First, you can get the full list of ops with a description:

$ opstr --list-ops
{ base64-decode::         base64 decoding of provided hexadecimal string #1
| base64-encode::         base64 encoding of provided string #1
| base64-url-safe-decode::        base64 decoding of provided string #1 with URL-appropriate representation (c.f. RFC 3548)
| base64-url-safe-encode::        base64 encoding of provided string #1 with URL-appropriate representation (c.f. RFC 3548)
| camelcase::     turn #1 to lowercase and replace the ASCII character after ' ' or '_' sequences with an uppercase letter
| center::        put string #1 in the middle of string of width #2 (default 80) repeating char #3 (default #) on both sides

[…]

| uppercase-for-ascii::   get locale-independent/ASCII uppercase version of string #1
| utf16-big-endian-bytes::        encode string #1 in UTF-16 and return its bytes in big endian order
| utf16-little-endian-bytes::     encode string #1 in UTF-16 and return its bytes in little endian order
| utf8-bytes::    encode string #1 in UTF-8 and return its bytes
| word-clusters::         return “Word clusters” of string #1 according to Unicode Standard Annex 29 “Unicode Text Segmentation”
| xml-decode::    replace the 5 pre-defined XML entities with their unescaped characters &<>"' in string #1
| xml-encode::    replace the 5 characters &<>"' with their pre-defined XML entities in string #1
}

(So I always cutting the output short, because the list is quite long)

If you consider the list, you can run one specific op:

$ opstr --op center "main classes"
################################# main classes #################################

$ opstr --op center "main classes" 32
######### main classes #########

$ opstr --op center "main classes" 32 _
_________ main classes _________

You can see op “center” generates a centered string between hash symbols within a width of 80 characters. But optional arguments allow you to adjust this size.

$ opstr --op codepoints "main classes"
[ 109
| 97
| 105
| 110
| 32
| 99
| 108
| 97
| 115
| 115
| 101
| 115
]

Op “codepoints” returns the list of Unicode codepoints. The output is meant to be very human-readable, but if you need the output for a specific programming language, it can be dumped as such:

$ opstr --syntax go --op codepoints "main classes"
[]int64{109, 97, 105, 110, 32, 99, 108, 97, 115, 115, 101, 115}

$ opstr --syntax kotlin --op codepoints "main classes"
arrayOf(109uL, 97uL, 105uL, 110uL, 32uL, 99uL, 108uL, 97uL, 115uL, 115uL, 101uL, 115uL)

$ opstr --syntax perl --op codepoints "main classes"
(109, 97, 105, 110, 32, 99, 108, 97, 115, 115, 101, 115)

$ opstr --syntax c --op codepoints "main classes"
uint64_t list[12] = {109, 97, 105, 110, 32, 99, 108, 97, 115, 115, 101, 115};

$ opstr --syntax rust --op codepoints "main classes"
let mut array: [int64; 12] = [109, 97, 105, 110, 32, 99, 108, 97, 115, 115, 101, 115];

(In subsequent releases, I might standardize questions like “should a variable name be printed?” better across languages)

Thus, several languages are supported. You might not care to always provide the syntax specifier as command-line argument. So I also provide configuration through environment variables:

$ export OPSTR_SYNTAX=go
$ opstr --op codepoints "main classes"
let mut array: [int64; 12] = [109, 97, 105, 110, 32, 99, 108, 97, 115, 115, 101, 115];

Furthermore, I want to take a look at outputs with tables:

$ opstr --op codepoint-frequencies "free-open source software"
header := []string{"frequency", "percentage", "codepoint", "codepoint-name"}
[][]any{[]any{5, 20, "e", "LATIN SMALL LETTER E"}
, []any{3, 12, "o", "LATIN SMALL LETTER O"}
, []any{3, 12, "r", "LATIN SMALL LETTER R"}
, []any{2, 8, " ", "SPACE"}
[]any{2, 8, "f", "LATIN SMALL LETTER F"}
, []any{2, 8, "s", "LATIN SMALL LETTER S"}
, []any{1, 4, "-", "HYPHEN-MINUS"}
, []any{1, 4, "a", "LATIN SMALL LETTER A"}
, []any{1, 4, "c", "LATIN SMALL LETTER C"}
, []any{1, 4, "n", "LATIN SMALL LETTER N"}
, []any{1, 4, "p", "LATIN SMALL LETTER P"}
, []any{1, 4, "t", "LATIN SMALL LETTER T"}
, []any{1, 4, "u", "LATIN SMALL LETTER U"}
, []any{1, 4, "w", "LATIN SMALL LETTER W"}
, }

Op “codepoint-frequencies” shows that “e” occurs most often and “w” least often. From this table, we can extract one specific entry through the --item and --column arguments:

$ opstr --op codepoint-frequencies --item 1 "free-open source software"
header := []string{"frequency", "percentage", "codepoint", "codepoint-name"}
[][]any{[]any{3, 12, "o", "LATIN SMALL LETTER O"}
, }

$ opstr --op codepoint-frequencies --item 0 "free-open source software"
header := []string{"frequency", "percentage", "codepoint", "codepoint-name"}
[][]any{[]any{5, 20, "e", "LATIN SMALL LETTER E"}
, }

$ opstr --op codepoint-frequencies --item 0 --column codepoint "free-open source software"
"e"

$ opstr --op codepoint-frequencies --item 1 --column codepoint "free-open source software"
"o"

$ opstr --op codepoint-frequencies --item -1 --column codepoint "free-open source software"
"w"

So --item uses zero-based indices for the rows and negative values refer to the opposite order. That is powerful, you can easily extract information from the string op!

Finally, I want to talk more complex stuff:

$ opstr --op sort --locale sv-se  d ö t a
NOTE: Using default locale data shipped with this program to initialize collator to sort strings
[]string{"a", "d", "t", "ö"}

$ opstr --op sort --locale de-at  d ö t a
NOTE: Using default locale data shipped with this program to initialize collator to sort strings
[]string{"a", "d", "ö", "t"}

Sorting strings is locale-dependent. So in Swedish, the ö is put to the end of the alphabet. But in Austrian German, the ö is put in the middle of the alphabet after “o”. The executable ships with all locale data for “en-us” per default and additional locale data can be provided through the OPSTR_LOCALE_DATAFILE environment variable generated for icu4x, the Unicode library I depend on. I admit I was surprised myself that my locale datafile for “en-US” has the collation data for sv-se and de-at included. Once I sufficiently understand which conventional information is stored in the data file, I will write a better guide into the README file. For now, I limited the number of locale-dependent ops to a minimum to get everything right. This all those complexities, I still consider this as an advantage over conventional tools, because sorting is usually limited to lexicographical sorting based on the Unicode codepoint (which make little sense).

So let us get Unicode right when we operate on strings.

Conclusion

opstr is a neat command-line tool to apply operations to strings. As a static executable, it can be easily deployed and it already provides a multitude of operations. Unicode and regular expression support is provided, but will be extended in subsequent releases.