[squeak-dev] Proposal | Stop writing ]lang[ leadingChar-runs in chunks (i.e., file-outs)

Sat Jan 29 17:33:00 UTC 2022

> On 29. Jan 2022, at 17:55, Marcel Taeumel <marcel.taeumel at hpi.de> wrote:
> 
> Ha! ]lang[ was the only way to mix different languages in a single chunk of source code before Unicode appeared (or UTF8 was used for file-out). The traditional file-out encoding was MacRoman plus the #leadingChar info via ]lang[ (similar to ]style[ ) so that a file-in could directly attach those bits (thus the encoding) to the respective characters again. Hmm... maybe ... no. Decoding would treat only single bytes ... so no multi-byte encoding at the end. Hmm... I still think it is about mixing different encodings in a single chunk of serialized code. But why? Maybe it is not for mixing encodings within a chunk what for supporting many chunks with different encodings? Hmm... but looking at the .changes or .sources file, you could only combine 1-byte encodings ... I have no clue ... why to file-out ]lang[?
> 
> Now I also wonder ... looking at all those TextConverter classes ... them converting to "squeak" seems to mean converting to "Unicode + leadingChar" these days. Was this always the case?
> 

No. Squeak was macroman first, and, if I recall correctly the account of Andreas and Yoshiki, then iso8859-1, and then Unicode.

Best regards
	-Tobias

> Best,
> Marcel
>> Am 29.01.2022 16:16:39 schrieb Marcel Taeumel <marcel.taeumel at hpi.de>:
>> 
>> Hi all --
>> 
>> We all know that the path to a clean multilingual Squeak based on Unicode is still bumpy. The concept of #leadingChar was an interesting trade-off to efficiently extend pre-rendered StrikeFont's with selected Unicode glyphs. See StrikeFontSet and #DefaultMultiStyle. For example, #installFonts for a JapaneseEnvironment is still functional: http://metatoys.org/pub/FontJapaneseEnvironment.sar
>> 
>> Anyway. #fallbackFont in StrikeFont (and TTCFont) can do (almost) the same trick. Without having to modify the Unicode code points with a #leadingChar. That is, the original intent to circumvent Unicode's Han unification might still be preserved by configuring different fallback fonts and/or text styles per use case (e.g., tool, text field, ...).
>> 
>> As a first step, I propose to stop file-ing out those leadingChar-runs as ]lang[ in the chunk format. Even with the current implementation of leadingChar, it makes no sense to preserve them outside the .image. For example, reading code from a .changes, .sources, .st, or .cs file might preserve the leadingChar's -- however -- working with the respective code fragments (e.g., cut/copy/paste) will immediately remove that "mask on characters" anyway if the system has a different leadingChar (or 0 for Latin1/Unicode). I cannot quite grasp the original intention of preserving those in chunk format. Maybe it was performance.
>> 
>> So ... any more thoughts on this? Maybe a plausible explanation on why ]lang[ exists in the first place given that all converters sending #leadingChar:code: seem to attach that on-the-fly anyway?
>> 
>> Best,
>> Marcel
>