[squeak-dev] Unicode

17 Mar 2020


      Hi all! :-)
After some recent fun with the Unicode class, I found out that its data is quite out of date (for example, the comments do not even "know" that code points can be longer than 4 bytes. Younger characters such as 😺❤🤓 are not categorized correctly, etc. ...). Luckily, there is already some logic to fetch the latest data from www.unicode.orghttp://www.unicode.org. I'm currently reworking this logic because it's not completely automated yet and has some slips, but so long, I have one general question for you:
At the moment, we have 30 class variables each for one Unicode category number. These class vars map in alphabetical order to the integers from 0 to: 29. Is this tedious structure really necessary? For different purposes, I would like to get the category name of a specific code point from a client. The current design makes this impossible without writing additional mappings.
Tl;dr: I would like to propose to drop these class variables and use Symbols instead. They are comparable like integers, and as they are flyweights, this should not be a performance issue either. Of course, #generalCategoryOf: will have to keep returning numbers, but we could deprecate it and use a new #generalTagOf: in the future. Furthermore, this would also allow us to deal with later added category names (though I don't know whether this will ever happen).
Examples:
Unicode generalTagOf: $a asUnicode. "#Ll"
Unicode class >> isLetterCode: charCode
  ^ (self generalTagOf: charCode) first = $L
Unicode class >> isAlphaNumericCode: charCode
  | tag|
  ^ (tag := self generalCategoryOf: charCode) first = $L
        or: [tag = #Nd]
How do you think about this proposal? Please let me know and I will go ahead! :D
Best,
Christoph