[squeak-dev] Proposal | Stop writing ]lang[ leadingChar-runs in chunks (i.e., file-outs)

Marcel Taeumel marcel.taeumel at hpi.de
Sat Jan 29 16:55:08 UTC 2022

Ha! ]lang[ was the only way to mix different languages in a single chunk of source code before Unicode appeared (or UTF8 was used for file-out). The traditional file-out encoding was MacRoman plus the #leadingChar info via ]lang[ (similar to ]style[ ) so that a file-in could directly attach those bits (thus the encoding) to the respective characters again. Hmm... maybe ... no. Decoding would treat only single bytes ... so no multi-byte encoding at the end. Hmm... I still think it is about mixing different encodings in a single chunk of serialized code. But why? Maybe it is not for mixing encodings within a chunk what for supporting many chunks with different encodings? Hmm... but looking at the .changes or .sources file, you could only combine 1-byte encodings ... I have no clue ... why to file-out ]lang[?

Now I also wonder ... looking at all those TextConverter classes ... them converting to "squeak" seems to mean converting to "Unicode + leadingChar" these days. Was this always the case?

Am 29.01.2022 16:16:39 schrieb Marcel Taeumel <marcel.taeumel at hpi.de>:
Hi all --

We all know that the path to a clean multilingual Squeak based on Unicode is still bumpy. The concept of #leadingChar was an interesting trade-off to efficiently extend pre-rendered StrikeFont's with selected Unicode glyphs. See StrikeFontSet and #DefaultMultiStyle. For example, #installFonts for a JapaneseEnvironment is still functional: http://metatoys.org/pub/FontJapaneseEnvironment.sar

Anyway. #fallbackFont in StrikeFont (and TTCFont) can do (almost) the same trick. Without having to modify the Unicode code points with a #leadingChar. That is, the original intent to circumvent Unicode's Han unification might still be preserved by configuring different fallback fonts and/or text styles per use case (e.g., tool, text field, ...).

As a first step, I propose to stop file-ing out those leadingChar-runs as ]lang[ in the chunk format. Even with the current implementation of leadingChar, it makes no sense to preserve them outside the .image. For example, reading code from a .changes, .sources, .st, or .cs file might preserve the leadingChar's -- however -- working with the respective code fragments (e.g., cut/copy/paste) will immediately remove that "mask on characters" anyway if the system has a different leadingChar (or 0 for Latin1/Unicode). I cannot quite grasp the original intention of preserving those in chunk format. Maybe it was performance.

So ... any more thoughts on this? Maybe a plausible explanation on why ]lang[ exists in the first place given that all converters sending #leadingChar:code: seem to attach that on-the-fly anyway?

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20220129/bd462aa2/attachment.html>

More information about the Squeak-dev mailing list