[squeak-dev] Unicode

Tue Mar 17 22:51:09 UTC 2020

Hi all! :-)

After some recent fun with the Unicode class, I found out that its data is quite out of date (for example, the comments do not even "know" that code points can be longer than 4 bytes. Younger characters such as 😺❤🤓 are not categorized correctly, etc. ...). Luckily, there is already some logic to fetch the latest data from www.unicode.org<http://www.unicode.org>. I'm currently reworking this logic because it's not completely automated yet and has some slips, but so long, I have one general question for you:

At the moment, we have 30 class variables each for one Unicode category number. These class vars map in alphabetical order to the integers from 0 to: 29. Is this tedious structure really necessary? For different purposes, I would like to get the category name of a specific code point from a client. The current design makes this impossible without writing additional mappings.

Tl;dr: I would like to propose to drop these class variables and use Symbols instead. They are comparable like integers, and as they are flyweights, this should not be a performance issue either. Of course, #generalCategoryOf: will have to keep returning numbers, but we could deprecate it and use a new #generalTagOf: in the future. Furthermore, this would also allow us to deal with later added category names (though I don't know whether this will ever happen).

Examples:
Unicode generalTagOf: $a asUnicode. "#Ll"

Unicode class >> isLetterCode: charCode
  ^ (self generalTagOf: charCode) first = $L

Unicode class >> isAlphaNumericCode: charCode
  | tag|
  ^ (tag := self generalCategoryOf: charCode) first = $L
        or: [tag = #Nd]

How do you think about this proposal? Please let me know and I will go ahead! :D

Best,
Christoph
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20200317/5d0dbc45/attachment.html>