<div dir="ltr">This may be out of the scope of your project, but there is also the issue that Squeak/Pharo don&#39;t display most characters. Copy the following in a workspace and only the first line is rendered properly. At a minimum there should be font substitution happening when the current font doesn&#39;t contain the necessary glyphs.<div><br></div><div><div>Welcome </div><div>καλωσόρισμα</div><div>добро пожаловать</div><div>בברכה</div><div>أهلا بك</div><div>स्वागत</div><div>ยินดีต้อนรับ</div><div>ようこそ</div><div>欢迎</div><div>환영</div><div>❄</div><div>𝄞</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Dec 4, 2015 at 3:42 AM, EuanM <span dir="ltr">&lt;<a href="mailto:euanmee@gmail.com" target="_blank">euanmee@gmail.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">I&#39;m currently groping my way to seeing how feature-complete our<br>

Unicode support is.  I am doing this to establish what still needs to<br>

be done to provide full Unicode support.<br>

<br>

This seems to me to be an area where it would be best to write it<br>

once, and then have the same codebase incorporated into the Smalltalks<br>

that most share a common ancestry.<br>

<br>

I am keen to get: equality-testing for strings; sortability for<br>

strings which have ligatures and diacritic characters; and correct<br>

round-tripping of data.<br>

<br>

Call to action:<br>

==========<br>

<br>

If you have comments on these proposals - such as &quot;but we already have<br>

that facility&quot; or &quot;the reason we do not have these facilities is<br>

because they are dog-slow&quot; - please let me know them.<br>

<br>

If you would like to help out, please let me know.<br>

<br>

If you have Unicode experience and expertise, and would like to be, or<br>

would be willing to be, in the  &#39;council of experts&#39; for this project,<br>

please let me know.<br>

<br>

If you have comments or ideas on anything mentioned in this email<br>

<br>

In the first instance, the initiative&#39;s website will be:<br>

<a href="http://smalltalk.uk.to/unicode.html" rel="noreferrer" target="_blank">http://smalltalk.uk.to/unicode.html</a><br>

<br>

I have created a SqueakSource.com project called UnicodeSupport<br>

<br>

I want to avoid re-inventing any facilities which already exist.<br>

Except where they prevent us reaching the goals of:<br>

  - sortable UTF8 strings<br>

  - sortable UTF16 strings<br>

  - equivalence testing of 2 UTF8 strings<br>

  - equivalence testing of 2 UTF16 strings<br>

  - round-tripping UTF8 strings through Smalltalk<br>

  - roundtripping UTF16 strings through Smalltalk.<br>

As I understand it, we have limited Unicode support atm.<br>

<br>

Current state of play<br>

===============<br>

ByteString gets converted to WideString when need is automagically detected.<br>

<br>

Is there anything else that currently exists?<br>

<br>

Definition of Terms<br>

==============<br>

A quick definition of terms before I go any further:<br>

<br>

Standard terms from the Unicode standard<br>

===============================<br>

a compatibility character : an additional encoding of a *normal*<br>

character, for compatibility and round-trip conversion purposes.  For<br>

instance, a 1-byte encoding of a Latin character with a diacritic.<br>

<br>

Made-up terms<br>

============<br>

a convenience codepoint :  a single codepoint which represents an item<br>

that is also encoded as a string of codepoints.<br>

<br>

(I tend to use the terms compatibility character and compatibility<br>

codepoint interchangably.  The standard only refers to them as<br>

compatibility characters.  However, the standard is determined to<br>

emphasise that characters are abstract and that codepoints are<br>

concrete.  So I think it is often more useful and productive to think<br>

of compatibility or convenience codepoints).<br>

<br>

a composed character :  a character made up of several codepoints<br>

<br>

Unicode encoding explained<br>

=====================<br>

A convenience codepoint can therefore be thought of as a code point<br>

used for a character which also has a composed form.<br>

<br>

The way Unicode works is that sometimes you can encode a character in<br>

one byte, sometimes not.  Sometimes you can encode it in two bytes,<br>

sometimes not.<br>

<br>

You can therefore have a long stream of ASCII which is single-byte<br>

Unicode.  If there is an occasional Cyrillic or Greek character in the<br>

stream, it would be represented either by a compatibility character or<br>

by a multi-byte combination.<br>

<br>

Using compatibility characters can prevent proper sorting and<br>

equivalence testing.<br>

<br>

Using &quot;pure&quot; Unicode, ie. &quot;normal encodings&quot;, can cause compatibility<br>

and round-tripping probelms.  Although avoiding them can *also* cause<br>

compatibility issues and round-tripping problems.<br>

<br>

Currently my thinking is:<br>

<br>

a Utf8String class<br>

an Ordered collection, with 1 byte characters as the modal element,<br>

but short arrays of wider strings where necessary<br>

a Utf16String class<br>

an Ordered collection, with 2 byte characters as the modal element,<br>

but short arrays of wider strings<br>

beginning with a 2-byte endianness indicator.<br>

<br>

Utf8Strings sometimes need to be sortable, and sometimes need to be compatible.<br>

<br>

So my thinking is that Utf8String will contain convenience codepoints,<br>

for round-tripping.  And where there are multiple convenience<br>

codepoints for a character, that it standardises on one.<br>

<br>

And that there is a Utf8SortableString which uses *only* normal characters.<br>

<br>

We then need methods to convert between the two.<br>

<br>

aUtf8String asUtf8SortableString<br>

<br>

and<br>

<br>

aUtf8SortableString asUtf8String<br>

<br>

<br>

Sort orders are culture and context dependent - Sweden and Germany<br>

have different sort orders for the same diacritic-ed characters.  Some<br>

countries have one order in general usage, and another for specific<br>

usages, such as phone directories (e.g. UK and France)<br>

<br>

Similarly for Utf16 :  Utf16String and Utf16SortableString and<br>

conversion methods<br>

<br>

A list of sorted words would be a SortedCollection, and there could be<br>

pre-prepared sortBlocks for them, e.g. frPhoneBookOrder, deOrder,<br>

seOrder, ukOrder, etc<br>

<br>

along the lines of<br>

aListOfWords := SortedCollection sortBlock: deOrder<br>

<br>

If a word is either a Utf8SortableString, or a well-formed Utf8String,<br>

then we can perform equivalence testing on them trivially.<br>

<br>

To make sure a Utf8String is well formed, we would need to have a way<br>

of cleaning up any convenience codepoints which were valid, but which<br>

were for a character which has multiple equally-valid alternative<br>

convenience codepoints, and for which the string currently had the<br>

&quot;wrong&quot; convenience codepoint.  (i.e for any character with valid<br>

alternative convenience codepoints, we would choose one to be in the<br>

well-formed Utf8String, and we would need a method for cleaning the<br>

alternative convenience codepoints out of the string, and replacing<br>

them with the chosen approved convenience codepoint.<br>

<br>

aUtf8String cleanUtf8String<br>

<br>

With WideString, a lot of the issues disappear - except<br>

round-tripping(although I&#39;m sure I have seen something recently about<br>

4-byte strings that also have an additional bit.  Which would make<br>

some Unicode characters 5-bytes long.)<br>

<br>

<br>

(I&#39;m starting to zone out now - if I&#39;ve overlooked anything - obvious,<br>

subtle, or somewhere in between, please let me know)<br>

<br>

Cheers,<br>

    Euan<br>

<br>

</blockquote></div><br></div></div>