[Vm-dev] Re: Unix VM path encodings

John M McIntosh johnmci at smalltalkconsulting.com
Sun Dec 30 08:48:55 UTC 2007

For Sophie we spend hours/days? getting this correct. However we  
really didn't check out the pure Unix variation.

Actually let's some other utf-32 character versus ü

Oh let's say LATIN CAPITAL LETTER SCHWA   -> UTF-8  0xC68F UTF-32  


in the os-x 10.5.1 Finder we see


and in a terminal session we see

-rw-r--r--   1 johnmci  staff  26451 Apr 10  2007 Ə.png

in both cases just in case you can't see this in the email the  
character is visually correct.

Using Squeak 3.10Alpha 7092 with a Mac Carbon VM 3.8.18b1 set to utf8
when we use the file list morphic What we see is
the ? is 0x3F

of course it says it can't open the file, because the smalltalk code  
(which code is an exercise for the reader) has mangled the 0xC68F into  
0x3F. In asking about this a few years back I think I was told it  
converts the VM data to latin1. However the conversion from macroman  
to latin1 and back *usually* is workable, mind only if the characters  
are <= 0xFF

However utf8 to latin1 usually ends up broken which is why the mac  
carbon VM is set to macroman by default.

Recall that in os-x HFS Plus converts all file names to decomposed  
Unicode, while Macintosh keyboards generally produce precomposed  
Unicode.  The macintosh carbon VM converts back and forth between the  
pre-composed to decomposed unicode when it is using UTF8 encoding.   
This also depends on the file system and what it thinks it wants to  
store unicode characters as...

Now in Sophie when we import this into Sophie the URI that was  
generated is


that becomes

'/Users/johnmci/Work In Progress/squeak Bugs/Æ∑.png'

But the VM ensures the proper thing is done. In sophie we store all  
media paths as encoded URI objects, and convert to
what is required when we need to access the media.

Oh and btw if you enter
into FireFox, it's happy too.  Oddly when you enter it into Safari it  
becomes file:///Users/johnmci/Work%20In%20Progress/squeak%20Bugs/Ə.png

Oh and if I take the Ə.png from a terminal session, or the finder  
and paste into a Sophie text field, yes it's Ə.png
because the extended clipboard support converts it properly from utf8,
Mind in TextEdit it comes across as RTF which is a different issue,  
but *still* is correctly converted into utf-32 in Sophie.

People of course are welcome to uncover unicode character issue with  
Sophie and how it deals with file names or
textual data in text fields.

On Dec 29, 2007, at 11:32 PM, Andreas Raab wrote:

> Hi -
> Due to a bug reported against Qwaq Forums I needed to look into how  
> the Unix VM encodes file and path names and got terribly confused.  
> My test case was to create a file with an Umlaut("Jürgen") and to  
> see what both Squeak and the Unix shell reports with varying  
> settings of -pathenc and -textenc.
> I started with the assumption that since the file system I was  
> running this on is UTF-8 the default settings (-textenc MacRoman and  
> -pathenc UTF-8) ought to be correct. However, the result was very  
> surprising. The file name was reported incorrectly both in the file  
> list as well as by the OS - the file list reported "J?" (truncated  
> after the question mark) and the Unix shell reported "J?rgen" but  
> with a "funky ?" (the glyph is hard to describe without a  
> screenshot; it was neither an umlaut nor a regular question mark).
> Playing with the settings I could not find any combination that  
> resulted in a consistent representation for all the different views  
> - either the Unix shell was off or Squeak's view was off no matter  
> how I set those encodings. Can someone explain to me how I need to  
> set these values to get a consistent view on file names both from  
> Squeak and Unix?
> Cheers,
>  - Andreas

John M. McIntosh <johnmci at smalltalkconsulting.com>
Corporate Smalltalk Consulting Ltd.  http://www.smalltalkconsulting.com

More information about the Vm-dev mailing list