Typesetting jyutping parallel texts

Analysing challenges and solutions for typesetting parallel jyutping-annotated Chinese / English texts. A post-mortem after completing the Gospel of Matthew, which at 33,000+ characters is the longest jyutping work and the only parallel Zh-jyutping / En text.

Jyutping⁺ makes pronouncing Cantonese easy with tone-marks and subtle color effects. Visit here for jyutping+ material including long-form text and audio eBooks. I also made the Cantonese Font to magically annotate your text with Jyutping⁺ on desktop machines, in offline apps that you already use (documents, slides) and online pages (spotify lyrics, Wikipedia/wikisource).

粵拼⁺ 使用色彩配合符號簡潔表示聲調，令學習廣東話更容易。 在這裏你可以找到使用粵拼⁺標示嘅書刊、包括有語音嘅電子書。我用一年時間研發新技術，煉咗套「 粵語字體」，佢可以幫你文件、教材裏面嘅中文字自動加上準確粵拼⁺，亦都可以用嚟瀏覽網頁(例如Spotify 歌詞、維基百科)。

Motivations

Typesetting refers to the way text is placed on a page. Setting irregular, richly annotated parallel Chinese+jyutping / English text is the trickiest typesetting I have done. Here are a few sample pages from the finished project, a parallel Gospel of Matthew designed for landscape A4 print. The last image was the same technique applied to a A4 portrait (the initial attempt).

Chinese poses a unique difficulty for learners with a Latin script background, as the writing conveys ideas (ideographs) and is generally detached from the sounds. Romanization is the process where Latin pronunciation is attached to an ideograph to indicate the sounds. For Mandarin, this is standardized as pinyin (拼音); Cantonese has two modern standards, Yale and Jyutping (粵拼), that have the same expressiveness but uses different annotation formats.

The major difference between Yale and Jyutping is in how they designate tones. Yale distinguishes high-low using character h as a marker and inflections with diacritics (symbols above vowels, like ā); jyutping uses numerals 1-6. Tones are essential to the meaning of an utterance: no can tone, no Cantonese. Jyutping is generally easier to type (saam1) but their interpretation requires being familiar with just what 1 means.

I dabbled with teaching Cantonese since 2016, and for adult/teen, only the high-intent learners would pick up jyutping; being native speakers doesn’t generally help with jyutping fluency.¹ And then, in long prose, it just look intimidating and alien.

From about 2017 I have, for my own teaching notes, applied some design touches to make speaking the jyutping more intuitive. One of the key insight is to provide a visual hint (tone mark) that shows both the high-low placement and inflection, and to place the tone vertically relative to the other tones. Judicious choice of colors and scale make the jyutping very readable but not “loud”, while directly progresses to “normal jyutping”. This variation is what I call Jyutping+.

For educational multimedia, placing conceptually related items in close proximity is very helpful. The goal, then, is to establish a method that let us co-locate

Chinese ideographs
Jyutping+ romanization
English translation, and
Spoken audio

The project includes slide-deck versions that have spoken audio, but that is the subject for some other day. This article is about how to prepare manuscripts that co-locate (1)–(3); specifically outlining the difficulties in actually co-locating them, and design decisions that resolves the issues. There are general solutions, but this article will not be a technical LaTeX treatise.

Challenges

The first challenge is how to assign the jyutping(+). Cantonese has extensive tone changes for many characters (sandhi), some characters have 6 not-unusual readings, and assigning jyutping is time-consuming and require some experience.

The panel shows some typical examples: there are always one character duplicated in each image, and note how the context drives the sound — and in some cases, the sound drives the meaning.

Technology can help, but not so much. Standard automation using PyCantonese library carries about 4% error rate — that is, 1-in-25 characters are wrong — which makes doing long texts extremely tedious. The longest text prior to this, completed by Chaak and his merry band of Words.hk editors, was The Little Prince (~20000 characters). Matthew is a full 10,000 more and it was done solo within 8 weeks.

A companion article (soon!) describes how I improved this process. That process needs to solve not only the assignment problems, but also the demands created by putting text on pages.² Markup and typesetting intertwine, but for now, we accept that all the basic information is there and correct; we just don’t know how to place them on paper.

1. Jyutping+ is not text

Word processors (at least in their East Asian editions) have features that allow text to be placed over ideographs; this is called ruby-text or furigana (振（ふ）り仮名（がな）). They are generally ugly to look at, and even if we accept ugliness, they can’t do jyutping+.

See, jyutping+ asks for:

4 colors,
a irregular symbol, and
tone number that is placed fractionally vertically

The irregular symbol (tone mark) was drawn as a piece of vector graphic (PDF/SVG), and they are irregular in their bounding box. Some tones tend to be read a little longer and this is reflected in the mark. The tone numbers are not just superscript nor subscript; each of the six floats at different levels.

Last row is the best MS Word can do. Notice the (1) vertical compression of Latin glyphs, (2) rigid fit over characters, and (3) plain text input.

Non-text content in furigana really isn’t handled very well with most commercial software, and in any case, scaling these performantly to 30,000+ jyutping+ is an issue.

Solution

Your friendly neighbourhood LaTeX evangelist: Use LaTeX 🤗 🥳

Specifically, what we need here is defining a custom macro jyutping/4:

% \jyutping{initial}{nucleus}{coda}{tone}

% example
\jyutping{s}{aa}{m}{1}

The jyutping macro is simple, formating each part of the string. Conditional blocks (\if \fi) are used for the tone. We can then wrap jyutping/4 into the ruby macro:

% \ruby{text}{furigana}

\ruby{三}{\jyutping{s}{aa}{m}{1}}

How to do this in Word

Some of you just need to do this in Word, and here’s how you add furigana. Note that the interface / terminology differs in all the different versions of Word / operating system, so your mileage might vary,

1. Select your text, and choose Format / Asian Layout / Phonetic Guide.

2. Annotate the Chinese text with furigana in the Ruby Text section, then choose the furigana font and alignments. If you move text around, you may find the OK grayed out; Word 2023 / Mac expects every row in the Ruby Text section to be filled. Indulge the software and give it one space for the undeletable rows, and then delete these unnecessary spaces.

Caveat: due to a bug, this works really poorly. When you want to edit a previously saved annotation, Word will automatically split the Chinese text back into one character per row, and dump the ruby text for multiple characters into one character.

2. Jyutping+ is long. Or short.

The long ones are problematic. Let’s look at how glyphs are generally designed:

English fonts comprise a great variety, but generally the width of one Zh glyph approximates 2 En glyphs: the following two items should look similar in width on your screen: 唔 m4.

Chinese fonts are built with glyphs that are uniform width (they must be; your eye can feel something is off even if glyph widths varies ~5%). But jyutping can be short (m4) or long (saam1) or very long (ngaang6). Not even in the shortest case, m4, can jyutping fit over a glyph at 100%.

This creates a dilemma:

If we fit the average jyutping over a glyph, then adjacent long ones overlaps in their jyutping.
If we fit the very long ones over a single character (a decision I made for the Canto Font v1) then the jyutping has to be quite small (about 20% the font size). You can vertically compress Latin text to squeeze out a little more (can get to 30%), but more distortion is more ugliness too.

Solution

These two issues actually points to two solutions:

we should allow jyutping+ to conditionally spill-over to its neighbours (A -> B).

For example, if we have the fragment 唔硬 (m4 ngaang6), we should encourage the long jyutping to spill-over into the short; and long jyutping at the ends of a line should be encouraged to spill out into the gaps.

we can increase the spacing between characters, so jyutping have more space to fill (B -> C)

There is a trade-off; text becomes less contiguous and less readable at large separations.

The final project blends these two solutions, and adds two rule-of-thumb. The third rule states that jyutping should preferably stays over a word (multiple characters), and spill-over to a neighbouring word only when absolutely necessary. This creates a subtle effect where the spacing between Chinese characters are uniform, but there appears to be word segmentation within the jyutping.

The last rule-of-thumb states that the jyutping are allowed to spill-over into the beginning and end of the line.

With all of these in place, the jyutping+ can increase in font-size to about 50% that of the Chinese glyphs. Subjectively there is a big jump in legibility for long prose, around 40%.

3. Chinese text have no word boundaries

Unlike English where words are separated by spaces, Chinese sentences have contiguous characters and words are not marked by boundaries. Automated layout simply break off a new line whenever there is enough characters for a new line. This w-

-ould not just be ugly, like a sent-

-ence in English that breaks up wo-

-rds in awkward places, but in Chinese, can also change the meaning of the prose.

Consider the following examples (the second of which is vulgar):

In case A1, 南京.市長.江大橋 means “Mr GONG Daai Kiu, the mayor of Nan-jing”, whereas 南京市.長江大橋 in A2 means “Bridge over Yangtze River in Nan-jing”.

In B1, 忍者龜.頭.很大 means “teenage mutant ninja turtles have big heads”; B2 忍者.龜頭.很大 “ninjas have large dick heads”; B3 忍.者龜頭.很大 “Nin-” “-ja [ambiguous interpretation]”.

Lines break thus can influence how the reader interprets the prose. All of these readings are plausible, and the editor need to make a decision on what is the right cut. The general, aspirational, solution is that every group of characters that hangs as a semantic unit ought to be annotated as such in the mark-up.

4. Handling proper nouns

The nature of working with the Bible is that there are many, many proper nouns: groups of characters that is a word only in the context of the Bible. Take the beginning verses in Chapter 1 of Matthew, which describes Jesus’ ancestral tree:

These are all proper nouns that must fulfil two constraints:

not be broken by a line break, and
must be underlined. In Chinese, the convention is to underline people names and geographical locations, plus wiggle underline book names.

Solution to 3 and 4

The solution here is an undocumented feature of \ruby, where the object is handled with no line breaks. (This is not so obvious given ruby texts are allowed to interact with other objects.) Given this, if we use ruby/2 not to wrap individual characters, but group of characters that correspond to a word, then line breaks are correct.

Within ruby/2, the first argument now fully contains the proper noun; we can then use the built in feature from the xeCJK package to underline (with the specific spacing tht CJK text requires).

By now this is looking pretty gnarly:

\tripleruby{\CJKunderline{猶大}{ }}{\jyutping{j}{a}{u}{4}\jyutping{d}{aa}{i}{6}}{} 
\tripleruby{從}{\jyutping{c}{u}{ng}{4}}{} \tripleruby{\CJKunderline{他瑪}氏{ }}{\jyutping{t}{aa}{}{1}\jyutping{m}{aa}{}{5}\jyutping{s}{i}{}{6}}{} 
\tripleruby{生}{\jyutping{s}{a}{ng}{1}}{} \tripleruby{\CJKunderline{法勒斯}{ }}{\jyutping{f}{aa}{t}{3}\jyutping{l}{aa}{k}{6}\jyutping{s}{i}{}{1}}{} 
\tripleruby{和}{\jyutping{w}{o}{}{4}}{} \tripleruby{\CJKunderline{謝拉}{ }}{\jyutping{z}{e}{}{6}\jyutping{l}{aa}{i}{1}}{} •
\tripleruby{\CJKunderline{法勒斯}{ }}{\jyutping{f}{aa}{t}{3}\jyutping{l}{aa}{k}{6}\jyutping{s}{i}{}{1}}{} 
\tripleruby{生}{\jyutping{s}{a}{ng}{1}}{} \tripleruby{\CJKunderline{希斯崙}{ }}{\jyutping{h}{e}{i}{1}\jyutping{s}{i}{}{1}\jyutping{l}{eo}{n}{4}}{} •
\tripleruby{\CJKunderline{希斯崙}{ }}{\jyutping{h}{e}{i}{1}\jyutping{s}{i}{}{1}\jyutping{l}{eo}{n}{4}}{} \tripleruby{生}{\jyutping{s}{a}{ng}{1}}{} 
\tripleruby{\CJKunderline{亞蘭}{ }}{\jyutping{}{aa}{}{3}\jyutping{l}{aa}{n}{4}}{} •\\

And this is only one verse of the jyutping+ annotated Chinese text 😭

Astute readers notice that I have swapped in tripleruby/3 where there used to be ruby/2. This is a custom macro where the last argument is furigana that is center-aligned under the text, and also hides some major shenanigans to correct for spacing.³

The even more inquisitive readers are wondering where these “groups of characters correspond to a word” come from. The answer is that I perform word segmentation (分詞) along with the jyutping assignment, and augmented that with a custom dictionary of all of the Bible words.⁴ The segmentation is 95% right for proper nouns, and the rest are acceptable.⁵ More on this in the companion article.

And finally, we have the Chinese-jyutping+ sorted.

5. Two languages need to be parallel

The Chinese / jyutping+ text needs to be set in one column, and the English text is set parallel on another column. The relative length of the Chinese and English text varies: some verses of Chinese text is much longer than the English, and vice versa in others. This is specific to the translation, relative font-size, inter-CJK spacing, and thus would change on a project-by-project basis. (That is probably true for everything here: the methods make the impossible possible, but doesn’t exactly make it easy or automatic.)

A surprising variable here is the length of proper nouns. Given proper nouns need accept no line breaks, we need to allocate suitable column spacing for them to (occasionally) spill over:

While the verse structure makes referencing a particular paragraph easy, when readers (perhaps in a reading group) want to refer to a particular line, they need to be able to talk about “character 3 in line 40,” but line 15 is different in Chinese and English.

This means regular tooling no longer works, and we need to bring out reledpar. Reledpar / reledmac are the state-of-the-art LaTeX packages for handling critical editions, and all the complexities around setting parallel editions are mostly solved, including language-specific footnotes. All that is needed is endless tinkering.

(The eagle-eyed reader will see that the right column line-numbers sit in the English spill-over zone. Won’t they clash? Yes they would. Part of the tinkering is to make sure that there is no overlap across the ~3,300 lines.)

6. 神 and 上帝

Chinese Bible translations have an additional complication: when the first localization happened some 120 years ago, Protestant translators could not agree on whether “God” should be translated as 神 (god) or 上帝 (supreme king).⁶ They agree to disagree, and made translations where all else is the same except for what token is used to designate “God/the LORD”. This schism remains until today where denominations use one but not the other. This changes the spacing of typesetting wholesale, and used to be a big, big problem. Most printed Chinese bibles, in fact, contains a space ⬚ before 神: the typesetter does the 上帝 edition, then by replacing 上帝 (two glyphs) with ⬚神 (two glyphs), nothing else needs to be changed.

Solution

Instead of writing this in our markup:

% before
\tripleruby{神}{\jyutping{s}{a}{n}{4}}{}

% after
\God{}

We can abstract references to the Lord out into a special macro God/0. This means that in the preamble I can simply specify this is a 上帝 edition, and every reference to God/0 is substituted and every line gets re-calculated.

Wrap up

The process

The typesetting runs themselves take about 5 minutes on my M1 macbook pro, which is entirely acceptable. These can probably be reduced by baking each individual jyutping (maybe 2000?) into a vector image, then place the image instead of placing the text (so one fixed bounding box per jyutping instead of >4 glyphs + 1 vector image). I have fidgeted endlessly on the jyutping appearance (the slides and the landscape PDF each have their own optimized jyutping!), so it might not have been worth the re-factoring.

In terms of the development, from text files to PDF, the jyutping assignment + typesetting took about 6-8 weeks; it was another month to record and arrange the audio / slides. The development is fast, and in very few places did I feel stuck (mostly undocumented behaviour around CJK in LaTeX). I attribute that to being in full command of the Elixir–LaTeX stack. Lots of grunt work but at just 1,100 or so verses, it’s acceptable. LaTeX makes this whole process fidgety but at least possible. I marvel all the time how this 1979-originated ‘legacy’ software can be stretched so, so far; and per the Lindy Effect, this workflow will probably still work in 2048.

The good

It was always going to be exceptional in quality (I’m fastidious & obsessive), but the system ends up marvellously flexible too. We control every aspect at every granularity: from jyutping -> character -> word -> paragraphs -> page -> chapter, but unlike a hand-chiseled Illustrator/InDesign job, it is possible to work at scale, making changes across hundreds of pages easily.

For a demo of the raw, unadjusted output, you can download this PDF:

jyutping_typeset_demo_output Download

Are you working on projects involving jyutping(+), demanding typesetting, or complex data transformations? Let’s talk. Get things done right, and skip the open-ended uncertainties.

如果你有項目需要粵拼⁺ , 技術性質的排版, 以及困難、大量的資料處理，可以聯絡我討論 / 報價。

The bad

What I don’t like is that the jyutping processing is currently a one-way flow, from text -> word segments -> (elixir structs) -> LaTeX markup -> parallel column markup. If this becomes projects I do repeatedly, I would build this differently. There are two paths forward:

a limited LaTeX markup parser to bring feedback/corrections back upstream, or
a Phoenix LiveView UI that makes the jyutping corrections, underlining, parallel tagging etc happen in-memory

There would be two advantages: first is that the feedback/corrections can re-flow to other output formats, and the second is that the whole machinery learns with each project, and get better over time (this is particularly important w.r.t. v2 of the Canto Font). Three of the re-flows I have in mind is:

HTML. This would make for a web version, as well as ePub/mobi eBooks;
Typst. Typst is a late-2010s re-thinking of LaTeX, written in Rust. It doesn’t have the ecology to fulfill my needs yet (as of Sep 2023 Typst doesn’t even have a package manager!), but for what it can do, it’s fast. LaTeX is slow, has a min 10 sec spin up time for jyutping+, and doesn’t work in parallel (I don’t know why). A single file takes 30 sec in LaTeX, but 0.030 seconds in Typst. Orchestrated through the BEAM, one hundred separate files take almost an hour in LaTeX, but 0.375 sec in Typst (!!).
Audio. By stitching mp3s of individual jyutping, this makes it possible to correct by listening. This is quite important for jyutping that is parallel to audio; correcting jyutpings that are correct-but-not-optimal, like 聽 as teng1 / ting1 / ting3, is otherwise quite tricky.

The difficulty would be [very limited parser] < [LiveView UI / data structure] < [parser]. When I have a totally free month…

(Speaking of the audio. Being in Hong Kong, I have nowhere that is free from outside noise. I rented a somewhat quiet multi-purpose room for the 25 hours of recording, but that still introduced some noise, which post-processing reduced into a wobble. Cantonese-jyutping works has no market, and I can’t justify 25hr x HK$350/hr of recording studio expense.)

Footnotes

I have probably read as much jyutping as anyone, but I am still not fluent in the sense that I can see a plain jyutping and read it loud. Jyutping+, yes. ↩︎
It’s Elixir code, including two open-source libraries built just for this. ↩︎
We want there to be extended spacing between Chinese characters (to accommodate jyutping+), so we specify special behaviour for CJK glyphs (Zh-Zh). This, however, creates several cases of spacing to be handled: (En-Zh, Zh-En), along with implicit En-En. This is complicated by ruby-text being objects with mixed Zh and En… ↩︎
All the Bible proper nouns?? Yes. Luckily, I held onto a little pamphlet from early 2000s (no longer published?), which carries translation of Catholic Chinese term to the Protestant terms, along with their English, Greek, and Hebrew translations. This was digitized and used as the source. ↩︎
What is a word seems simple but is messy. Consider []; should it be [].[] , or all four characters as one unit? Or any negations, such as []. Note that English “not necessary” and “unnecessary” could be one or two words. ↩︎
A concern was that 神 god does not suitably distinguish between a multitude of (pagan) deities (think the Chinese equivalent of the bickering Zeus family). The Catholic Church standardizes this to 天主 (“our Lord in Heaven”); in fact, the name for Catholic in Chinese is 天主教, “the religion of 天主”. ↩︎

Changelog

v1.0 (2022-09-23). First public version.
v1.01 (2022-29-24). Added treatment of God token for Catholics in the footnote; added additional use of handling corrections upstream, or passing jyutping corrections upstream.
v1.10 (2022-09-25). Added graphics to illustrate typesetting flexibility. Added “path forward” and Typst.
v1.12 (2022-09-28). Added demo PDF.

Responses

awongdev

September 24, 2023 2:31 am

This is amazing work and an incredible write-up. Thank you for doing it.

1. Jon Chui
  
  September 24, 2023 9:41 am
  
  Thank you. I think there are different pieces that would be useful to different people, so it’s worth putting some time in this. I’ve *almost* worked out how to make interactive bits (e.g., instead of listing the size combinations, let the reader play and discover), and that’ll probably come along in the Compendium to Cantonese Romanization Systems article (sometime next month?).

Typesetting jyutping parallel texts

Table of Contents

Receive updates? 📨

Motivations

Challenges

1. Jyutping+ is not text

Solution

How to do this in Word

2. Jyutping+ is long. Or short.

Solution

3. Chinese text have no word boundaries

4. Handling proper nouns

Solution to 3 and 4

5. Two languages need to be parallel

6. 神 and 上帝

Solution

Wrap up

The process

The good

The bad

Footnotes

Changelog

Share

Responses

Leave a ReplyCancel reply

Discover more from jon.hk