[Vm-dev] Re: Unix VM path encodings

John M McIntosh johnmci at smalltalkconsulting.com
Sun Dec 30 22:30:59 UTC 2007

On Dec 30, 2007, at 4:09 AM, Andreas Raab wrote:

> Yoshiki Ohshima wrote:
>>> Hm ... lemme try this ... ah, interesting. It appears that I can  
>>> make the Umlauts work on Unix correctly if and only if:
>>> * I fix the above method to return UTF8TextConverter in every case  
>>> [*1]
>>> * I use -pathenc MacRoman -textenc MacRoman
>>> Which makes no sense to me since neither the path nor the text  
>>> encoding is MacRoman but it appears to work. Huh?
>>  Yes, on Unix VM, another historical mishappen caused it; "MacRoman"
>> still means "no conversion" so that if the image passes UTF-8 string,
>> the UTF-8 string is passed to system calls.
> Playing around a little it appears as if the Unix VM always converts  
> path names with the assumption that Squeak uses MacRoman in the  
> image and only -pathenc affects the translation between file system  
> and the image (i.e., -textenc has *no* effect on path name  
> translation whatsoever). Can someone confirm this? It would explain  
> why -pathenc MacRoman works (since like you say it's really the "no  
> conversion" flag) if combined with a proper file name converter in  
> the image.

Mmm for the -pathenc and the -textenc from what I can see the data  
coming from the file system  is said to exist
in the form -pathenc and translated to a CFString in UTF-32, then  
translated back to a byte string in -textenc.

In sending the data to the file system, it said it exists in the form - 
textenc, then translated to a CFString in UTF-32
then     CFStringNormalize(str, kCFStringNormalizationFormD); //  
canonical decomposition, then translated back to
a byte string in -pathenc.

I'll note the kCFStringNormalizationFormD operation (and all above/ 
below) only occurs if this is macintosh. If this is a Linux/BSD unix  
system then iconv is used.  So is this on a mac or some Linux/BSD  

That and I think a kCFStringNormalizationFormC is needed in the first  
step to properly compose the characters.

For background

Ok, let's see in the mac carbon vm we get back from the file system  for

0xC6, 0x8F, A, 0xCC, 0x81, .png   Note how the A0xCC81 is the  
decomposed UTF8, this is what is stored in the HFS+ file system.

We convert that from UTF8 to the target of MacRoman for path names by  
default in the base carbon VM.
This means converting to a CFString from kCFStringEncodingUTF8
then applying
CFStringNormalize(str, kCFStringNormalizationFormC); // pre-combined
then pulling back the bytes as MacRoman, that becomes

?, 0xE7, .png

since the translation of the Ə  from utf8 to macroman fails,  but the  
E7 is correct macroman for the Á

Now if I set the vm up to use UTF8 as the path name default.

after we apply the kCFStringNormalizationFormC step and pull back the  
data as UTF8 it is

0xC6, 0x8F, 0xC3, 0x81, .png   where the 0xC381 is the  (LATIN CAPITAL  
LETTER A WITH ACUTE) in UTF8 or 0x00c1 in utf-16 or 0x000000c1 in utf-32

Now if I remove the CFStringNormalize(str,  
kCFStringNormalizationFormC); // pre-combined
which does not exist in the base unix vm, then I get back

0xC6, 0x8F, A, 0xCC, 0x81,  '.png'

which is the decomposed UTF8.   I'll note in the file browser it shows  

but it does work....

NOW the question is what does the translation do...  mmm Well if I try
LanguageEnvironment classPool at: #FileNameConverterClass put:  
then it shows:

which is mmm, less wrong? But it does work.

Now if in the base unix VM you take  path encoding and text encoding  
and set to macroman
Convert(ux,sq, Path, uxPathEncoding, sqTextEncoding, 0, 0);
then we are saying the operating system (unix) path encoding is  
macroman and the squeak path name encoding is macroman that gives back

0xC6, 0x8F, A, 0xCC, 0x81, '.png'

since as thought the macroman to macroman translation does nothing.

However the UTF8TextConverter does not work with decomposed UTF8 so  
what the user would see is not correct, since it assumes we are  
working with precomposed UTF8 when it converts it to UTF-32 for the  
font system's enjoyment.

I suspect to fix properly a
CFStringNormalize(str, kCFStringNormalizationFormC);
is needed in the Unix sqUnixCharConv.c and applied when applicable.

Notes from Apple's site

For example, an Á (A acute) can be encoded either precomposed, as U 
+00C1 (LATIN CAPITAL LETTER A WITH ACUTE), or decomposed, as U+0041 U 
+0301 (UTF8 is 0xCC81)  (LATIN CAPITAL LETTER A followed by a  
COMBINING ACUTE ACCENT). Precomposed characters are more common in the  
Windows world, whereas decomposed characters are more common on the Mac.

John M. McIntosh <johnmci at smalltalkconsulting.com>
Corporate Smalltalk Consulting Ltd.  http://www.smalltalkconsulting.com

More information about the Vm-dev mailing list