Invisible Markup within Fonts

Cantonese Font v2 has an intricate system of OpenType features. The strangest part is what I call the Invisible Markup.

The Problem

Cantonese (to a lesser extent Mandarin, and perhaps other Sinitic languages) breaks the (English) common expectation that the writing uniquely informs the semantic. In Cantonese, there are situations where the combination of written script and sound is needed to pin down a meaning. Here are some examples:

明朝
- ming4 ziu1: tomorrow morning
- ming4 ciu4: Ming dynasty
背書
- bui3 syu1: underwrite (finance)
- bui6 syu1: learn by rote
冇喎
- mou5 wo3: but it isn’t there
- mou5 wo5: I am skeptical it isn’t there
學生會好
- hok6 saan1 wui3 hou2: the students will be good
- hok6 saan1 wui2 hou2: student union is good

Frequent and irregular tone changes also means that it is difficult to predict what the sound would be from the written script alone. State-of-the-art in 2022, on normal (not intentionally tricky) prose, had a error rate of 3.7%. This means that 1-in-28 assignments are wrong. The error rate here is the upper bound for downstream technologies such as text-to-speech engines.

Sinitic languages also do not use spaces to mark out word boundaries. How a sentence is divided into words changes the meaning significantly:

兒子|生性|病母|倍感|安慰: with the son behaving, the sick mother is much relieved
兒子|生|性病|母|倍感|安慰: the son caught STD, and mom is much relieved

Literary convention also asks for punctuations under words. Two common examples are “underline” of proper names (geographical, people), and “wavy underline” of book names. These are not emphasis / presentational attributes: they are punctuation marks.

In other words, we accept that when we write 明朝南京市長江大橋會出席喎, and ask someone to read it out loud and explain what the sentence means, the reader will mentally

probabilistically best-segment the sentence to its units (based on prior personal knowledge),
assign readings to each segment,
use the combination of segment writing and sound to assemble the overall meaning.

In most cases this would be attempted iteratively, for the different possible results. In this case, the reader may try the following parses in order:

明朝南京市長江大橋會出席喎: During Ming dynasty, in the city of Nanjing, the Yangtze River bridge will attend… [well, bridges don’t attend things, so let’s try…]
明朝南京市長江大橋會出席喎: During Ming dynasty, the mayor of Nanjing (Mr) Gong Daai Kiu will attend; or so I heard. [there’s maybe no bridge or mayors in Ming dynasty? That is probably “tomorrow morning”]
But is it
- “Tomorrow morning the mayor of Nanjing (Mr) Gong will attend; or so I heard” or
- “Tomorrow morning the mayor of Nanjing (Mr) Gong is said to attend; but I don’t believe it”?

…and at the end the semantics is still ambiguous (and so is the pronunciation). This is, admittedly, a contrived example; what it illustrates is that a pure string of Chinese characters misses critical information.

Invisible Markup as Solution

Canto Font v2 (Pokfield hereafter), by not rendering certain characters / combinations, covertly builds up a markup system. The markup system is simple but addresses all of the above points and then some.

The first mark we use is the | pipe symbol as a segment marker. The font is instructed to substitute this symbol, every time it is encountered, with the zero-width non-joiner (2C00). This character does exactly what it sounds like: it is zero-width (thus no appearance), and it does nothing to the surrounding characters.

In the image here, the same text is rendered twice, top with Pokfield, bottom as plain-text. You can see that the pipe word-segmenters are effectively markers for machines but invisible to humans. A simple String.split(markup, "|") will return a list of

["明朝","南京市長","江大橋","會","出席","喎"]

where the segments, and in this case the meaning, is pinned down more.¹

The second marks are the . dot/period symbol. Whenever a dot occur, they must be followed by a Jyutping (Cantonese romanization), and this specifies the pronunciation. We illustrate this by further modifying our example from above:

Note how the glyph for 朝 and 喎 has rolled the user-override into account.

The third set of marks is a [tagging] system, where the character(s) to be tagged is enclosed in square brackets [ ], and it can be tagged with at most 2 single-character tags and a 0-99 number. Again let’s look at extending our example:

We have provided an ID for 明朝 (perhaps to provide the exact date in a footnote) and 江大橋; and flagged down 南京 and 江大橋 as proper names so that the underline name punctuation can be applied in typesetting systems capable of doing so. (The word boundaries also provide a way for typesetting systems to perform line-breaks. See my previous article on Typesetting Jyutping-annotated Chinese Text.). The name of the mayor has additionally received an emphasis instruction.

Look at what we have done here: the (still quite humanly readable) “dry” markup has uniquely annotated pronunciation and semantics that is machine-readable, but has not interfered at all with the user-output.

The astute reader would say, but the plain-text version has not uniquely annotated the pronunciation for all characters; we only explicitly labelled two of them. What we need to recall is that the font was prepared / driven from some software, and the same software is thus capable of “hydrating” this into a string where every sound is made explicit:

“But Jon! Nothing changed in the top line! And the bottom line looks like gibberish!” That’s exactly the point. The Hydrated form is a very information-dense markup, that specifies segments, sounds, semantics, and support tagging; it can be parsed by any programming language (without needing access to ExCantonese/Elixir); it can be used to drive LaTeX/Typst typesetting and HTML display; it is a much better starting point for, say, training text-to-speech systems; and this richness is generated largely as a side-effect, of what users would do to make their text render correctly anyway.

—-

Why is this important? Thorough documentation and a community repository are integral parts of the Pokfield release. With the community repository, I encourage submissions to be first-and-foremost of the plain-text (in Dry markup), and rendered forms are secondary (perhaps we set up CI/CD (continuous integration/delivery) like system that watches the repo forum and typeset / compile to rich PDF). Cantonese is a language with 85,000,000 users, but nonetheless considered low resource due to the lack of annotated written material. Over time, while users use Pokfield for their own end, the community accumulates a body of annotated Cantonese texts with permissive licensing.

Why is this interesting? AFAIK both the concept of embedding an invisible markup language, and the technique to do so, are new. I am almost certain that this isn’t quite the right final form. We need to think through, and try to minimize, conflicts with other commonly used markup systems; but at least we know this is technologically viable — and may stimulate some of you to extend upon it in the years to come.

Implementation

The characters to be made invisible are placed inside a one-to-one Lookup:

lookup INVISIBLE_SYMBOL {
  sub bar by uni2C00; # replaces | with zwnj
  ...
  sub bracketleft by uni2C00; # yes, just [ is "standalone" invisible
} INVISIBLE_SYMBOL;

The closing tag is… abit odd.

@tag [n b e x]; # name, book, emphasis, strikethrough
@digit [one two three four five six seven eight nine zero]

lookup CLOSE_TAG {
  sub space @tag space @digit bracketright; # replaces " n 1]"
  ...
} CLOSE_TAG;

Note that the above line supports one character-tag, and single digit number. More lines are needed to cover the cases with two tags and single digit, two tags with two digits, two tags and no digits, digits before tags, …and so on.

Before you naively do what I first tried to do, which was a glorious general system for writing html-like attributes with up to 16 characters:

…you need to know that, during compile time, the groups are expanded and every rule enumerated explicitly. With @alphanumeric containing 26 + 26 + 10 + 2 = 64 members, that last line alone would resolve to 64¹⁶ = 7.92 x 10²⁸ rules. When your fonts involve 131.6 kilo-mol of rules you know you’ve messed up.

In practice there is a further limit of 64k characters per lookup, and this is applied after expansion. This places an upper bound of about 3,000 lines as what the tagging system can support. Given that I’m allocating 100 digits, the square of #tags must be at most 30. We *might* be able to get away with 6 tags; more if we impose more structure to how they are ordered and combined.²

There is an associated marker, \ backslash, that does the same replacement but that parsers should simply discard. I have an intuition that, theoretically, there are cases that we want to break up a ligature while we agree with the font that it is a word. I can’t think of an example yet. ↩︎
If we really really want, we can programmatically generate these lookups. For example, if we were to allow all lowercase characters as tags, we could specify in each lookup [CJK xx n], keeping n within a range of 4 or so (26 x 26 = 676). The theoretical upper-bound, with useExtension in place, is 2³² = 4.3 billion bytes, which would be something like 400 million rules. The practical upper-bounds I’ve seen are imposed by font-renderers, of which Adobe products are the most bewildering. Complex fonts fails first in Adobe products, but they fail differently in different products (even if we end-users consider similar, like Illustrator and InDesign), and again differently for the same product in different OS (e.g., fonts with >20,000 color SVGs fails to render in Mac/Illustrator, but renders fine in Win/Illustrator). To add insult to injury, I don’t think I’ve ever had any of my reported bugs fixed. ↩︎

jon.hk

Invisible Markup within Fonts

The Problem

Invisible Markup as Solution

Implementation

Leave a ReplyCancel reply

Invisible Markup within Fonts

The Problem

Invisible Markup as Solution

Implementation

Share

Leave a ReplyCancel reply

Discover more from jon.hk