[squeak-dev] Windows mapping from CF_UNICODE to Unicode

Fri Nov 18 22:05:23 UTC 2022

> On 18. Nov 2022, at 22:52, Eliot Miranda <eliot.miranda at gmail.com> wrote:
> 
> Hi All,
> 
>     does anyone know how Windows maps Unicode text to the CF_UNICODE format used in the clipboard? It seems to me that CF_UNICODE might simply be two-byte characters, excluding any codes beyond 16rFFFF.  Is it in fact UTF-16?

Windows being windows, this ought to be UTF-16. When MS adopted Unicode in the 90s, it was still "small" enough for 16Bit,
and was, in fact, UCS2. It got "upgraded" to UTF-16 around Windows 2000.

See: https://en.wikipedia.org/wiki/Unicode_in_Microsoft_Windows

NOTE: UTF-16 has a lot of fun with "surrogate pairs", which makes it possible to have the whole UCS4-spectrum of code points.
This is a lot messy, and surrogate pairs are invalid UTF-8, go figure.

Sidenode: This is the reason, why https://simonsapin.github.io/wtf-8/ exists.

Best regards
	-Tobias

> 
> If it is UTF-16 has anyone fixed our UTF16TextConverter?  I don't see any of the conveniences that exist for the UTF8TextConverter such as decodeString: etc.
> _,,,^..^,,,_
> best, Eliot
>