Experimental 3.5-1 VM and multilingual support

Ian Piumarta ian.piumarta at inria.fr
Tue Mar 4 07:52:00 UTC 2003


Folks,

For the intrepid explorers out there, I've just updated the 3.5-1devel
tarballs available in the usual place [1] (GNU/Linux{386,ppc}, Darwin/ppc
and a MacOSX .app bundle).  I'm particularly interested to hear about any
problems with 8-bit characters, either in copy/paste between Squeak and
remote applications or with international keyboards.  Here's the bulk of
the beef...

3.5-1 can now use arbitrary character set encodings for: the internal
font encoding, the encoding used to copy/paste text from/to remote
applications, and the encoding it expects the filesystem to be using.
The VM is now also capable of supplying the clipboard text to X11
applications that request STRING_UTF8 conversion.  Three new
command-line switches and three new environment variables are provided
to control the behaviour, as follows:

  -encoding <enc>  (or SQUEAK_ENCODING="<enc>")

    tells the VM which encoding is being used by the fonts within the
    image (and hence the encoding which is used for 8-bit characters
    arriving either from the keyboard or from text copied from
    elsewhere).  The default is still MacRoman, but if you are using
    the X11Fonts package then to get the accents back in the right
    places in all the X11 fonts simultaneously just set
	SQUEAK_ENCODING="ISO-8859-15" (or
	SQUEAK_ENCODING="Latin9" which is the same thing)
    in your environment.  This default will change to ISO-8859-15
    when the image drops the Apple fonts.

  -pathenc <enc>  (or SQUEAK_PATHENC="<enc>")

    tells the VM what encoding the filesystem is using.  Modern FS
    (Darwin and RedHat8 and maybe others too) use UTF-8 to encode
    8-bit chars in pathnames.  (Older Unixes probably either use
    Latin1 or simply barf or behave randomly according to how
    individual applications are written.)  The default is "UTF-8"
    (which is where the current Unix FS trend is heading for).  All
    file operations WITHIN THE VM SUPPORT CODE now convert the
    pathname character encoding between SQUEAK_ENCODING and
    SQUEAK_PATHENC as appropriate.  (If you have Latin1 chars in your
    paths then just set SQUEAK_PATHENC="ISO-8859-1" and things should
    work perfectly.)  File operations in other plugins (if there are
    such beasts) will probably get things hopelessly wrong on UTF-8
    based filesystems.  If any plugin writers out there want to know
    how to fix this situation (trivially) then send me email.

    Note that the above relates only to pathnames.  What the image
    chooses to do with the _contents_ of files is not the concern
    of the VM support code...

  -textenc <enc>  (or SQUEAK_TEXTENC="<enc>")

    tells the VM which encoding to use when asking other applications
    for the selection (or pasteboard) contents.  The default is
    ISO-8859-1 on X11 (since that's the standard 8-bit text encoding
    for X applications).  The default is UTF-8 on MacOSX (for the same
    reason).  If you would like Squeak to ask other X11 apps for
    selections converted as STRING_UTF8 then set
	SQUEAK_TEXTENC="UTF-8"
    but be warned that there are still _very_ few X11 apps that
    correctly honour such requests; Emacs in particular doesn't know
    what to do with them.  (Transferring UTF-8 text between two Unix
    Squeak VMs in this way [naturally] works just fine. ;)

    Note that setting SQUEAK_TEXTENC will not change the way Squeak
    _answers_ selection requests: if the requestor gives STRING as the
    target conversion type then it will get Latin1 encoded text; if it
    asks for STRING_UTF8 then it will (correctly) get UTF-8 encoded
    text.

Encoding names are not case-sensitive.

Other improvements for MacOSX users include:

  - 8-bit chars in HFS+ paths now work correctly (comes for free with
    the PATHENC conversion, and was pretty much the itch I scratched
    to arrive at all of the above encoding madness ;)

  - the final few problems with international keyboards should be
    fixed (Squeak should respond to deadkeys exactly like all other
    applications)

  - a problem with Squeak failing to reactivate correctly when
    deminiaturising from the dock (requiring a click away from and
    then back in the Squeak window) should no longer occur.

The X11 display driver currently doesn't implement deadkeys or
multikey composition at all.  (I think I've figured out enough about
the X input method stuff to make this work, but it would be
significant hassle.  If anybody really, _really_ wants this then let
me know and I'll do some experiments.)

Bon courage !

Ian


Note: the encoding names follow the IANA-registered character set
names.  The following are recoginsed on MacOSX (where I have to
provide a table to convert from a string name to an OS constant; names
on the same line are equivalent):

     MACROMAN MAC MACINTOSH CSMACINTOSH
     UTF8 UTF-8
     ISOLATIN9 LATIN9 ISO-8859-15
     ISOLATIN1 LATIN1 ISO-8859-1

Adding (lots of) others is trivial (but someone will have to prod me
to do it).

The X11 code uses the iconv(3) function that is built into most modern
versions of libc.  (GNU/Linux and BSD users got limited support in
glibc2.2 and much more complete support in glibc2.3.)  The VM
therefore recognises all registered coding systems (given a
sufficiently modern libc) including the entire Latin series, all the
MS codepages (too keep our antarctic friends happy) and even EBCDIC on
many systems.  If you have the 'iconv' program then the complete (very
long) list of supported encodings can be printed by running 'iconv -l'.



More information about the Squeak-dev mailing list