Unix VM path encodings
John M McIntosh
johnmci at smalltalkconsulting.com
Sun Dec 30 08:48:55 UTC 2007
For Sophie we spend hours/days? getting this correct. However we
really didn't check out the pure Unix variation.
Actually let's some other utf-32 character versus ü
Oh let's say LATIN CAPITAL LETTER SCHWA -> UTF-8 0xC68F UTF-32
0x0000018F
Ə
in the os-x 10.5.1 Finder we see
Ə.png
and in a terminal session we see
-rw-r--r-- 1 johnmci staff 26451 Apr 10 2007 Ə.png
in both cases just in case you can't see this in the email the
character is visually correct.
Using Squeak 3.10Alpha 7092 with a Mac Carbon VM 3.8.18b1 set to utf8
when we use the file list morphic What we see is
?.png
the ? is 0x3F
of course it says it can't open the file, because the smalltalk code
(which code is an exercise for the reader) has mangled the 0xC68F into
0x3F. In asking about this a few years back I think I was told it
converts the VM data to latin1. However the conversion from macroman
to latin1 and back *usually* is workable, mind only if the characters
are <= 0xFF
However utf8 to latin1 usually ends up broken which is why the mac
carbon VM is set to macroman by default.
Recall that in os-x HFS Plus converts all file names to decomposed
Unicode, while Macintosh keyboards generally produce precomposed
Unicode. The macintosh carbon VM converts back and forth between the
pre-composed to decomposed unicode when it is using UTF8 encoding.
This also depends on the file system and what it thinks it wants to
store unicode characters as...
Now in Sophie when we import this into Sophie the URI that was
generated is
/Users/johnmci/Work%20In%20Progress/squeak%20Bugs/%C6%8F.png
that becomes
'/Users/johnmci/Work In Progress/squeak Bugs/Æ∑.png'
But the VM ensures the proper thing is done. In sophie we store all
media paths as encoded URI objects, and convert to
what is required when we need to access the media.
Oh and btw if you enter
file:///Users/johnmci/Work%20In%20Progress/squeak%20Bugs/%C6%8F.png
into FireFox, it's happy too. Oddly when you enter it into Safari it
becomes file:///Users/johnmci/Work%20In%20Progress/squeak%20Bugs/Ə.png
Oh and if I take the Ə.png from a terminal session, or the finder
window
and paste into a Sophie text field, yes it's Ə.png
because the extended clipboard support converts it properly from utf8,
Mind in TextEdit it comes across as RTF which is a different issue,
but *still* is correctly converted into utf-32 in Sophie.
People of course are welcome to uncover unicode character issue with
Sophie and how it deals with file names or
textual data in text fields.
On Dec 29, 2007, at 11:32 PM, Andreas Raab wrote:
> Hi -
>
> Due to a bug reported against Qwaq Forums I needed to look into how
> the Unix VM encodes file and path names and got terribly confused.
> My test case was to create a file with an Umlaut("Jürgen") and to
> see what both Squeak and the Unix shell reports with varying
> settings of -pathenc and -textenc.
>
> I started with the assumption that since the file system I was
> running this on is UTF-8 the default settings (-textenc MacRoman and
> -pathenc UTF-8) ought to be correct. However, the result was very
> surprising. The file name was reported incorrectly both in the file
> list as well as by the OS - the file list reported "J?" (truncated
> after the question mark) and the Unix shell reported "J?rgen" but
> with a "funky ?" (the glyph is hard to describe without a
> screenshot; it was neither an umlaut nor a regular question mark).
>
> Playing with the settings I could not find any combination that
> resulted in a consistent representation for all the different views
> - either the Unix shell was off or Squeak's view was off no matter
> how I set those encodings. Can someone explain to me how I need to
> set these values to get a consistent view on file names both from
> Squeak and Unix?
>
> Cheers,
> - Andreas
>
--
=
=
=
========================================================================
John M. McIntosh <johnmci at smalltalkconsulting.com>
Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com
=
=
=
========================================================================
More information about the Squeak-dev
mailing list
|