On Dec 30, 2007, at 4:09 AM, Andreas Raab wrote:
Yoshiki Ohshima wrote:
Hm ... lemme try this ... ah, interesting. It appears that I can make the Umlauts work on Unix correctly if and only if:
- I fix the above method to return UTF8TextConverter in every case
[*1]
- I use -pathenc MacRoman -textenc MacRoman
Which makes no sense to me since neither the path nor the text encoding is MacRoman but it appears to work. Huh?
Yes, on Unix VM, another historical mishappen caused it; "MacRoman" still means "no conversion" so that if the image passes UTF-8 string, the UTF-8 string is passed to system calls.
Playing around a little it appears as if the Unix VM always converts path names with the assumption that Squeak uses MacRoman in the image and only -pathenc affects the translation between file system and the image (i.e., -textenc has *no* effect on path name translation whatsoever). Can someone confirm this? It would explain why -pathenc MacRoman works (since like you say it's really the "no conversion" flag) if combined with a proper file name converter in the image.
Mmm for the -pathenc and the -textenc from what I can see the data coming from the file system is said to exist in the form -pathenc and translated to a CFString in UTF-32, then translated back to a byte string in -textenc.
In sending the data to the file system, it said it exists in the form - textenc, then translated to a CFString in UTF-32 then CFStringNormalize(str, kCFStringNormalizationFormD); // canonical decomposition, then translated back to a byte string in -pathenc.
I'll note the kCFStringNormalizationFormD operation (and all above/ below) only occurs if this is macintosh. If this is a Linux/BSD unix system then iconv is used. So is this on a mac or some Linux/BSD system?
That and I think a kCFStringNormalizationFormC is needed in the first step to properly compose the characters.
For background
Ok, let's see in the mac carbon vm we get back from the file system for LATIN CAPITAL LETTER SCHWA + LATIN CAPITAL LETTER A WITH ACUTE ƏÁ.png
0xC6, 0x8F, A, 0xCC, 0x81, .png Note how the A0xCC81 is the decomposed UTF8, this is what is stored in the HFS+ file system.
We convert that from UTF8 to the target of MacRoman for path names by default in the base carbon VM. This means converting to a CFString from kCFStringEncodingUTF8 then applying CFStringNormalize(str, kCFStringNormalizationFormC); // pre-combined then pulling back the bytes as MacRoman, that becomes
?, 0xE7, .png
since the translation of the Ə from utf8 to macroman fails, but the E7 is correct macroman for the Á
Now if I set the vm up to use UTF8 as the path name default.
after we apply the kCFStringNormalizationFormC step and pull back the data as UTF8 it is
0xC6, 0x8F, 0xC3, 0x81, .png where the 0xC381 is the (LATIN CAPITAL LETTER A WITH ACUTE) in UTF8 or 0x00c1 in utf-16 or 0x000000c1 in utf-32
Now if I remove the CFStringNormalize(str, kCFStringNormalizationFormC); // pre-combined which does not exist in the base unix vm, then I get back
0xC6, 0x8F, A, 0xCC, 0x81, '.png'
which is the decomposed UTF8. I'll note in the file browser it shows as Æ∑AÌ∞.png
but it does work....
NOW the question is what does the translation do... mmm Well if I try LanguageEnvironment classPool at: #FileNameConverterClass put: UTF8TextConverter then it shows: ?A?.png
which is mmm, less wrong? But it does work.
Now if in the base unix VM you take path encoding and text encoding and set to macroman Convert(ux,sq, Path, uxPathEncoding, sqTextEncoding, 0, 0); then we are saying the operating system (unix) path encoding is macroman and the squeak path name encoding is macroman that gives back
0xC6, 0x8F, A, 0xCC, 0x81, '.png'
since as thought the macroman to macroman translation does nothing.
However the UTF8TextConverter does not work with decomposed UTF8 so what the user would see is not correct, since it assumes we are working with precomposed UTF8 when it converts it to UTF-32 for the font system's enjoyment.
I suspect to fix properly a CFStringNormalize(str, kCFStringNormalizationFormC); is needed in the Unix sqUnixCharConv.c and applied when applicable.
Notes from Apple's site
For example, an Á (A acute) can be encoded either precomposed, as U +00C1 (LATIN CAPITAL LETTER A WITH ACUTE), or decomposed, as U+0041 U +0301 (UTF8 is 0xCC81) (LATIN CAPITAL LETTER A followed by a COMBINING ACUTE ACCENT). Precomposed characters are more common in the Windows world, whereas decomposed characters are more common on the Mac.
-- = = = ======================================================================== John M. McIntosh johnmci@smalltalkconsulting.com Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com = = = ========================================================================