File names was Re: Warning: Large Babel translation

Yoshiki Ohshima Yoshiki.Ohshima at acm.org
Sat Nov 15 02:22:03 UTC 2003


  John,

> Right now for example french canadian users when they name files using  
> certain
> accented characters will see them correctly in the squeak file
> browser.  

  Yes.

>   When
> we migrate to UTF8 strings then certain accented characters will become  
> multiple-byte sequences
> which won't render as expected.

  Yes.

> Also I'm not quite sure how they would  
> type them
> in because you are using a macroman character set to construct them,

  Ah, what do you mean by 'to construct'?  Meaning that the VM would
pass the keyboard input to image in MacRoman when this proposed new VM
is used?

> but on say
> a file open, I don't really have sufficient information to decide if  
> the incoming array of bytes
> is a String or UTF8.

  Are you talking about the file name, or the content of file?  Those
are completely different matter.  For the file names, you can tell it
because the VM knows the OS and its version on which it is running
under.  For the file contents, unless your file system allows you to
put attributes to files, there is no way to tell the encoding in the
flie in general.  It is completely common thing that you have
differently-encoded text files in the same file system.

> We could hunt that information down by looking at  
> the class, however
> what to do about older images & newer VMs etc is a question for  
> discussion. Maybe a
> different api for UTF8 file names? And perhaps a UTF8String class to  
> render text based on current font?

  No.  The Squeak image should use a uniform (or mostly uniform)
*internal* representation.  The external representation is used only
when the image and VM need to communicate.

> >   I assume that 'VM returns and deals with UTF-8 string' means that
> > the VM doesn't do any translation on file names; e.g. a byte sequence
> > that represents a file name in the file system doesn't get translated
> > any of dir_lookup() or upper, and passed to the image, right?
> 
> Right for OS-X the native file/directory strings are UTF-8 Right now we  
> carefully map back and forth between MacRoman

  For the file names, it does, yes.  And which isn't a nice thing
because information will be lost and can't restore it at image level
if it uses characters beyond 256.

  To ensure the compatibility, we need to have a flag to switch the
VM's behavior, I would imagine...

-- Yoshiki





More information about the Squeak-dev mailing list