[Vm-dev] Re: Unix VM path encodings
John M McIntosh
johnmci at smalltalkconsulting.com
Sun Dec 30 22:30:59 UTC 2007
On Dec 30, 2007, at 4:09 AM, Andreas Raab wrote:
> Yoshiki Ohshima wrote:
>>> Hm ... lemme try this ... ah, interesting. It appears that I can
>>> make the Umlauts work on Unix correctly if and only if:
>>> * I fix the above method to return UTF8TextConverter in every case
>>> * I use -pathenc MacRoman -textenc MacRoman
>>> Which makes no sense to me since neither the path nor the text
>>> encoding is MacRoman but it appears to work. Huh?
>> Yes, on Unix VM, another historical mishappen caused it; "MacRoman"
>> still means "no conversion" so that if the image passes UTF-8 string,
>> the UTF-8 string is passed to system calls.
> Playing around a little it appears as if the Unix VM always converts
> path names with the assumption that Squeak uses MacRoman in the
> image and only -pathenc affects the translation between file system
> and the image (i.e., -textenc has *no* effect on path name
> translation whatsoever). Can someone confirm this? It would explain
> why -pathenc MacRoman works (since like you say it's really the "no
> conversion" flag) if combined with a proper file name converter in
> the image.
Mmm for the -pathenc and the -textenc from what I can see the data
coming from the file system is said to exist
in the form -pathenc and translated to a CFString in UTF-32, then
translated back to a byte string in -textenc.
In sending the data to the file system, it said it exists in the form -
textenc, then translated to a CFString in UTF-32
then CFStringNormalize(str, kCFStringNormalizationFormD); //
canonical decomposition, then translated back to
a byte string in -pathenc.
I'll note the kCFStringNormalizationFormD operation (and all above/
below) only occurs if this is macintosh. If this is a Linux/BSD unix
system then iconv is used. So is this on a mac or some Linux/BSD
That and I think a kCFStringNormalizationFormC is needed in the first
step to properly compose the characters.
Ok, let's see in the mac carbon vm we get back from the file system for
LATIN CAPITAL LETTER SCHWA + LATIN CAPITAL LETTER A WITH ACUTE
0xC6, 0x8F, A, 0xCC, 0x81, .png Note how the A0xCC81 is the
decomposed UTF8, this is what is stored in the HFS+ file system.
We convert that from UTF8 to the target of MacRoman for path names by
default in the base carbon VM.
This means converting to a CFString from kCFStringEncodingUTF8
CFStringNormalize(str, kCFStringNormalizationFormC); // pre-combined
then pulling back the bytes as MacRoman, that becomes
?, 0xE7, .png
since the translation of the Ə from utf8 to macroman fails, but the
E7 is correct macroman for the Á
Now if I set the vm up to use UTF8 as the path name default.
after we apply the kCFStringNormalizationFormC step and pull back the
data as UTF8 it is
0xC6, 0x8F, 0xC3, 0x81, .png where the 0xC381 is the (LATIN CAPITAL
LETTER A WITH ACUTE) in UTF8 or 0x00c1 in utf-16 or 0x000000c1 in utf-32
Now if I remove the CFStringNormalize(str,
kCFStringNormalizationFormC); // pre-combined
which does not exist in the base unix vm, then I get back
0xC6, 0x8F, A, 0xCC, 0x81, '.png'
which is the decomposed UTF8. I'll note in the file browser it shows
but it does work....
NOW the question is what does the translation do... mmm Well if I try
LanguageEnvironment classPool at: #FileNameConverterClass put:
then it shows:
which is mmm, less wrong? But it does work.
Now if in the base unix VM you take path encoding and text encoding
and set to macroman
Convert(ux,sq, Path, uxPathEncoding, sqTextEncoding, 0, 0);
then we are saying the operating system (unix) path encoding is
macroman and the squeak path name encoding is macroman that gives back
0xC6, 0x8F, A, 0xCC, 0x81, '.png'
since as thought the macroman to macroman translation does nothing.
However the UTF8TextConverter does not work with decomposed UTF8 so
what the user would see is not correct, since it assumes we are
working with precomposed UTF8 when it converts it to UTF-32 for the
font system's enjoyment.
I suspect to fix properly a
is needed in the Unix sqUnixCharConv.c and applied when applicable.
Notes from Apple's site
For example, an Á (A acute) can be encoded either precomposed, as U
+00C1 (LATIN CAPITAL LETTER A WITH ACUTE), or decomposed, as U+0041 U
+0301 (UTF8 is 0xCC81) (LATIN CAPITAL LETTER A followed by a
COMBINING ACUTE ACCENT). Precomposed characters are more common in the
Windows world, whereas decomposed characters are more common on the Mac.
John M. McIntosh <johnmci at smalltalkconsulting.com>
Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com
More information about the Vm-dev