Hi all! :-)


After some recent fun with the Unicode class, I found out that its data is quite out of date (for example, the comments do not even "know" that code points can be longer than 4 bytes. Younger characters such as 😺❤🤓 are not categorized correctly, etc. ...). Luckily, there is already some logic to fetch the latest data from www.unicode.org. I'm currently reworking this logic because it's not completely automated yet and has some slips, but so long, I have one general question for you:


At the moment, we have 30 class variables each for one Unicode category number. These class vars map in alphabetical order to the integers from 0 to: 29. Is this tedious structure really necessary? For different purposes, I would like to get the category name of a specific code point from a client. The current design makes this impossible without writing additional mappings.

Tl;dr: I would like to propose to drop these class variables and use Symbols instead. They are comparable like integers, and as they are flyweights, this should not be a performance issue either. Of course, #generalCategoryOf: will have to keep returning numbers, but we could deprecate it and use a new #generalTagOf: in the future. Furthermore, this would also allow us to deal with later added category names (though I don't know whether this will ever happen).


Examples:
Unicode generalTagOf: $a asUnicode. "#Ll"

Unicode class >> isLetterCode: charCode
  ^ (self generalTagOf: charCode) first = $L

Unicode class >> isAlphaNumericCode: charCode
  | tag|
  ^ (tag := self generalCategoryOf: charCode) first = $L
        or: [tag = #Nd]

How do you think about this proposal? Please let me know and I will go ahead! :D

Best,
Christoph