[squeak-dev] Unicode

christoph.thiede at student.hpi.uni-potsdam.de christoph.thiede at student.hpi.uni-potsdam.de
Mon Feb 28 15:09:06 UTC 2022


Hi Marcel, thanks for the review! Below is an updated changeset. If you have no further objections, I would like to merge it within the next few days. :-)

=============== Summary ===============

Change Set:        UnicodeData
Date:            24 February 2022
Author:            Christoph Thiede

This changeset repairs the fetching & parsing of unicode category data. Usage:

    Unicode reinitializeData.
    Unicode generalCategoryLabelOf: 16r1F388. 'Symbol, Other'

This revision resolves some slips in the category tags, adds an interface for retrieving/converting tags, unifies the vocabulary of the Unicode protocol, integrates the #initializeUnicodeData into the class initializer, and adds some tests. Furthermore, the Unicode data are automatically reinitialized as part of the ReleaseBuilder.

Still present limitations include:
- Duplication between #parseUnicodeDataFrom: and #parseCompositionMappingFrom:
- Redundant and scattered declaration of character categories

For more information, see: http://lists.squeakfoundation.org/pipermail/squeak-dev/2020-March/208020.html

=============== Postscript ===============

"Postscript:
Leave the line above, and replace the rest of this comment by a useful one.
Executable statements should follow this comment, and should
be separated by periods, with no exclamation points (!).
Be sure to put any further comments in double-quotes, like this one."


=============== Diff ===============

ReleaseBuilder class>>prepareSourceCode {preparing} · ct 2/28/2022 15:54 (changed)
prepareSourceCode
    "Update packages. Remove foreign packages. Recompile."

    CurrentReadOnlySourceFiles cacheDuring:
        [self
            updateCorePackages;
            unloadForeignPackages;
            checkForDirtyPackages;
            loadWellKnownPackages;
            checkForUndeclaredSymbols;
            checkForNilCategories;
-             recompileAll]
+             recompileAll;
+             updateDatabases]

ReleaseBuilder class>>updateDatabases {scripts - support} · ct 2/28/2022 16:06
+ updateDatabases
+ 
+     Unicode reinitializeData.

Unicode class>>allCategoryTags {character classification} · ct 2/24/2022 19:41
+ allCategoryTags
+ 
+     ^ #(Cn Cc Cf Co Cs Ll Lm Lo Lt Lu Mc Me Mn Nd Nl No Pc Pd Pe Pf Pi Po Ps Sc Sk Sm So Zl Zp Zs)

Unicode class>>blocks320Comment {comments} · ct 2/28/2022 15:50 (changed)
blocks320Comment
+ "http://www.unicode.org/Public/3.2-Update/Blocks-3.2.0.txt"

"# Blocks-3.2.0.txt
# Correlated with Unicode 3.2
# Start Code..End Code; Block Name
0000..007F; Basic Latin
0080..00FF; Latin-1 Supplement
0100..017F; Latin Extended-A
0180..024F; Latin Extended-B
0250..02AF; IPA Extensions
02B0..02FF; Spacing Modifier Letters
0300..036F; Combining Diacritical Marks
0370..03FF; Greek and Coptic
0400..04FF; Cyrillic
0500..052F; Cyrillic Supplementary
0530..058F; Armenian
0590..05FF; Hebrew
0600..06FF; Arabic
0700..074F; Syriac
0780..07BF; Thaana
0900..097F; Devanagari
0980..09FF; Bengali
0A00..0A7F; Gurmukhi
0A80..0AFF; Gujarati
0B00..0B7F; Oriya
0B80..0BFF; Tamil
0C00..0C7F; Telugu
0C80..0CFF; Kannada
0D00..0D7F; Malayalam
0D80..0DFF; Sinhala
0E00..0E7F; Thai
0E80..0EFF; Lao
0F00..0FFF; Tibetan
1000..109F; Myanmar
10A0..10FF; Georgian
1100..11FF; Hangul Jamo
1200..137F; Ethiopic
13A0..13FF; Cherokee
1400..167F; Unified Canadian Aboriginal Syllabics
1680..169F; Ogham
16A0..16FF; Runic
1700..171F; Tagalog
1720..173F; Hanunoo
1740..175F; Buhid
1760..177F; Tagbanwa
1780..17FF; Khmer
1800..18AF; Mongolian
1E00..1EFF; Latin Extended Additional
1F00..1FFF; Greek Extended
2000..206F; General Punctuation
2070..209F; Superscripts and Subscripts
20A0..20CF; Currency Symbols
20D0..20FF; Combining Diacritical Marks for Symbols
2100..214F; Letterlike Symbols
2150..218F; Number Forms
2190..21FF; Arrows
2200..22FF; Mathematical Operators
2300..23FF; Miscellaneous Technical
2400..243F; Control Pictures
2440..245F; Optical Character Recognition
2460..24FF; Enclosed Alphanumerics
2500..257F; Box Drawing
2580..259F; Block Elements
25A0..25FF; Geometric Shapes
2600..26FF; Miscellaneous Symbols
2700..27BF; Dingbats
27C0..27EF; Miscellaneous Mathematical Symbols-A
27F0..27FF; Supplemental Arrows-A
2800..28FF; Braille Patterns
2900..297F; Supplemental Arrows-B
2980..29FF; Miscellaneous Mathematical Symbols-B
2A00..2AFF; Supplemental Mathematical Operators
2E80..2EFF; CJK Radicals Supplement
2F00..2FDF; Kangxi Radicals
2FF0..2FFF; Ideographic Description Characters
3000..303F; CJK Symbols and Punctuation
3040..309F; Hiragana
30A0..30FF; Katakana
3100..312F; Bopomofo
3130..318F; Hangul Compatibility Jamo
3190..319F; Kanbun
31A0..31BF; Bopomofo Extended
31F0..31FF; Katakana Phonetic Extensions
3200..32FF; Enclosed CJK Letters and Months
3300..33FF; CJK Compatibility
3400..4DBF; CJK Unified Ideographs Extension A
4E00..9FFF; CJK Unified Ideographs
A000..A48F; Yi Syllables
A490..A4CF; Yi Radicals
AC00..D7AF; Hangul Syllables
D800..DB7F; High Surrogates
DB80..DBFF; High Private Use Surrogates
DC00..DFFF; Low Surrogates
E000..F8FF; Private Use Area
F900..FAFF; CJK Compatibility Ideographs
FB00..FB4F; Alphabetic Presentation Forms
FB50..FDFF; Arabic Presentation Forms-A
FE00..FE0F; Variation Selectors
FE20..FE2F; Combining Half Marks
FE30..FE4F; CJK Compatibility Forms
FE50..FE6F; Small Form Variants
FE70..FEFF; Arabic Presentation Forms-B
FF00..FFEF; Halfwidth and Fullwidth Forms
FFF0..FFFF; Specials
10300..1032F; Old Italic
10330..1034F; Gothic
10400..1044F; Deseret
1D000..1D0FF; Byzantine Musical Symbols
1D100..1D1FF; Musical Symbols
1D400..1D7FF; Mathematical Alphanumeric Symbols
20000..2A6DF; CJK Unified Ideographs Extension B
2F800..2FA1F; CJK Compatibility Ideographs Supplement
E0000..E007F; Tags
F0000..FFFFF; Supplementary Private Use Area-A
100000..10FFFF; Supplementary Private Use Area-B


"

Unicode class>>generalCategoryIndexFromTag: {character classification} · ct 2/24/2022 19:33
+ generalCategoryIndexFromTag: tag
+ 
+     ^ (self allCategoryTags indexOf: tag) - 1

Unicode class>>generalCategoryLabelForTag: {character classification} · ct 2/24/2022 21:47
+ generalCategoryLabelForTag: tag
+ 
+     ^ self generalCategoryLabels at: (self generalCategoryIndexFromTag: tag) + 1

Unicode class>>generalCategoryTagOf: {character classification} · ct 2/24/2022 19:44
+ generalCategoryTagOf: aCharacterCode
+ 
+     ^ (self generalCategoryOf: aCharacterCode)
+         ifNotNil: [:code | self allCategoryTags at: code + 1 ifAbsent: [#Cn]]
+         ifNil: [#Cn]

Unicode class>>initialize {class initialization} · ct 2/28/2022 15:52 (changed)
initialize
    " Unicode initialize "
    self initializeTagConstants.
-     Compositions isEmptyOrNil ifTrue:[self initializeCompositionMappings].
+     
+     self flag: #deduplicate. "Currently, we are downloading and parsing #unicodeData twice."
+     Compositions isEmptyOrNil ifTrue: [self initializeCompositionMappings].
+     GeneralCategory isEmptyOrNil ifTrue: [self initializeUnicodeData].

Unicode class>>initializeUnicodeData {unicode data} · ct 2/24/2022 19:03
+ initializeUnicodeData
+     "self initializeUnicodeData"
+ 
+     self parseUnicodeDataFrom: self unicodeData readStream.

Unicode class>>parseUnicodeDataFrom: {unicode data} · ct 2/24/2022 19:33 (changed)
parseUnicodeDataFrom: stream
- "
-     self halt.
-     self parseUnicodeDataFile
- "
+     "self initializeUnicodeData."

-     | line fieldEnd point fieldStart toNumber generalCategory decimalProperty |
+     | line fieldEnd point fieldStart toNumber generalCategory decimalProperty tag |

    toNumber := [:quad | ('16r', quad) asNumber].

    GeneralCategory := SparseLargeTable new: 16rE0080 chunkSize: 1024 arrayClass: Array base: 1 defaultValue:  'Cn'.
    DecimalProperty := SparseLargeTable new: 16rE0080 chunkSize: 32 arrayClass: Array base: 1 defaultValue: -1.

    16r3400 to: 16r4DB5 do: [:i | GeneralCategory at: i+1 put: 'Lo'].
    16r4E00 to: 16r9FA5 do: [:i | GeneralCategory at: i+1 put: 'Lo'].
    16rAC00 to: 16rD7FF do: [:i | GeneralCategory at: i+1 put: 'Lo'].

    [(line := stream nextLine) size > 0] whileTrue: [
        fieldEnd := line indexOf: $; startingAt: 1.
        point := toNumber value: (line copyFrom: 1 to: fieldEnd - 1).
        point > 16rE007F ifTrue: [
            GeneralCategory zapDefaultOnlyEntries.
            DecimalProperty zapDefaultOnlyEntries.
            ^ self].
        2 to: 3 do: [:i |
            fieldStart := fieldEnd + 1.
            fieldEnd := line indexOf: $; startingAt: fieldStart.
        ].
-         generalCategory := line copyFrom: fieldStart to: fieldEnd - 1.
+         tag := line copyFrom: fieldStart to: fieldEnd - 1.
+         generalCategory := self generalCategoryIndexFromTag: tag.
        GeneralCategory at: point+1 put: generalCategory.
-         generalCategory = 'Nd' ifTrue: [
+         generalCategory = Nd ifTrue: [
            4 to: 7 do: [:i |
                fieldStart := fieldEnd + 1.
                fieldEnd := line indexOf: $; startingAt: fieldStart.
            ].
            decimalProperty :=  line copyFrom: fieldStart to: fieldEnd - 1.
            DecimalProperty at: point+1 put: decimalProperty asNumber.
        ].
    ].
    GeneralCategory zapDefaultOnlyEntries.
    DecimalProperty zapDefaultOnlyEntries.


Unicode class>>reinitializeData {class initialization} · ct 2/28/2022 16:05
+ reinitializeData
+ 
+     Compositions := GeneralCategory := nil.
+     self initialize.

UnicodeTest
+ TestCase subclass: #UnicodeTest
+     instanceVariableNames: ''
+     classVariableNames: ''
+     poolDictionaries: ''
+     category: 'MultilingualTests-Encodings'
+ 
+ UnicodeTest class 
+     instanceVariableNames: ''
+ 
+ ""

UnicodeTest class>>resources {accessing} · ct 2/24/2022 21:46
+ resources
+ 
+      ^ super resources copyWith: UnicodeTestResource

UnicodeTest>>testGeneralCategoryLabel {tests - character classification} · ct 2/24/2022 21:49
+ testGeneralCategoryLabel
+ 
+     self assert: 'Letter, Lowercase' equals: (Unicode generalCategoryLabelOf: $a asUnicode).
+     self assert: 'Letter, Uppercase' equals: (Unicode generalCategoryLabelOf: $Z asUnicode).
+     
+     self assert: 'Number, Decimal' equals: (Unicode generalCategoryLabelOf: $5 asUnicode).
+     self assert: 'Symbol, Other' equals: (Unicode generalCategoryLabelOf: 16r1F388).
+     
+     self assert: 'n/a' equals: (Unicode generalCategoryLabelOf: Float infinity).

UnicodeTest>>testGeneralCategoryLabelForTag {tests - character classification} · ct 2/24/2022 21:48
+ testGeneralCategoryLabelForTag
+ 
+     self assert: 'Letter, Lowercase' equals: (Unicode generalCategoryLabelForTag: #Ll).

UnicodeTest>>testGeneralCategoryTag {tests - character classification} · ct 2/24/2022 21:49
+ testGeneralCategoryTag
+ 
+     self assert: #Ll equals: (Unicode generalCategoryTagOf: $a asUnicode).
+     self assert: #Lu equals: (Unicode generalCategoryTagOf: $Z asUnicode).
+     
+     self assert: #Nd equals: (Unicode generalCategoryTagOf: $5 asUnicode).
+     self assert: #So equals: (Unicode generalCategoryTagOf: 16r1F388).
+     
+     self assert: #Cn equals: (Unicode generalCategoryTagOf: Float infinity).

UnicodeTestResource
+ TestResource subclass: #UnicodeTestResource
+     instanceVariableNames: ''
+     classVariableNames: ''
+     poolDictionaries: ''
+     category: 'MultilingualTests-Encodings'
+ 
+ UnicodeTestResource class 
+     instanceVariableNames: ''
+ 
+ ""

UnicodeTestResource>>setUp {running} · ct 2/24/2022 21:45
+ setUp
+ 
+     super setUp.
+     
+     "Test the functionality of this update logic"
+     Unicode initializeCompositionMappings.
+     Unicode initializeUnicodeData.

---
Sent from Squeak Inbox Talk

On 2022-02-25T11:36:53+01:00, marcel.taeumel at hpi.de wrote:

> Hi Christoph --
> 
> Thanks for doing this!
> 
> >  Is it okay to fetch the data from unicode.org via a postscript in the update stream? Hypothetically, some clients might use a proxy or a strict firewall/safelist of IP addresses.
> 
> I think it is okay. We could make it explicit in the ReleaseBuilder or in some external CI script but having it part of the update stream is okay. The updates are fetched from the outside anyway, right? ;-)
> 
> Best;
> Marcel
> Am 24.02.2022 22:16:42 schrieb christoph.thiede at student.hpi.uni-potsdam.de <christoph.thiede at student.hpi.uni-potsdam.de>:
> Hi all, Hi Marcel, Hi Levente,
> 
> finally, here is a changeset that takes the first step for updating or in-image Unicode database. After filing it in, please run:
> 
>     Unicode initializeUnicodeData.
> 
> In addition to the preamble (please read first below), I have still a number of questions:
> 
> * Is it okay to fetch the data from unicode.org via a postscript in the update stream? Hypothetically, some clients might use a proxy or a strict firewall/safelist of IP addresses.
> * How much effort shall we put in deduplicating the logic and the data in this class? This includes both the two similar parsing methods and the redundant specification of the Unicode character tags.
> 
> Best,
> Christoph
> 
> =============== Summary ===============
> 
> Change Set:        UnicodeData
> Date:            24 February 2022
> Author:            Christoph Thiede
> 
> This changeset repairs the fetching & parsing of unicode category data. Usage:
> 
>     Unicode initializeUnicodeData.
>     Unicode generalCategoryLabelOf: 16r1F388. 'Symbol, Other'
> 
> This revision resolves some slips in the category tags, adds an interface for retrieving/converting tags, unifies the vocabulary of the Unicode protocol, integrates the #initializeUnicodeData into the class initializer, and adds some tests.
> 
> Still present limitations include:
> - Duplication between #parseUnicodeDataFrom: and #parseCompositionMappingFrom:
> - Redundant and scattered declaration of character categories
> 
> For more information, see: http://lists.squeakfoundation.org/pipermail/squeak-dev/2020-March/208020.html
> 
> =============== Postscript ===============
> 
> "Postscript:
> Leave the line above, and replace the rest of this comment by a useful one.
> Executable statements should follow this comment, and should
> be separated by periods, with no exclamation points (!).
> Be sure to put any further comments in double-quotes, like this one."
> 
> 
> =============== Diff ===============
> 
> Unicode class>>allCategoryTags {character classification} · ct 2/24/2022 19:41
> + allCategoryTags
> +
> +     ^ #(Cn Cc Cf Co Cs Ll Lm Lo Lt Lu Mc Me Mn Nd Nl No Pc Pd Pe Pf Pi Po Ps Sc Sk Sm So Zl Zp Zs)
> 
> Unicode class>>generalCategoryIndexFromTag: {character classification} · ct 2/24/2022 19:33
> + generalCategoryIndexFromTag: tag
> +
> +     ^ (self allCategoryTags indexOf: tag) - 1
> 
> Unicode class>>generalCategoryLabelForTag: {character classification} · ct 2/24/2022 21:47
> + generalCategoryLabelForTag: tag
> +
> +     ^ self generalCategoryLabels at: (self generalCategoryIndexFromTag: tag) + 1
> 
> Unicode class>>generalCategoryTagOf: {character classification} · ct 2/24/2022 19:44
> + generalCategoryTagOf: aCharacterCode
> +
> +     ^ (self generalCategoryOf: aCharacterCode)
> +         ifNotNil: [:code | self allCategoryTags at: code + 1 ifAbsent: [#Cn]]
> +         ifNil: [#Cn]
> 
> Unicode class>>initialize {class initialization} · ct 2/24/2022 21:43 (changed)
> initialize
>     " Unicode initialize "
>     self initializeTagConstants.
> -     Compositions isEmptyOrNil ifTrue:[self initializeCompositionMappings].
> +     Compositions isEmptyOrNil ifTrue:[self initializeCompositionMappings].
> +     GeneralCategory isEmptyOrNil ifTrue: [self initializeUnicodeData].
> 
> Unicode class>>initializeUnicodeData {unicode data} · ct 2/24/2022 19:03
> + initializeUnicodeData
> +     "self initializeUnicodeData"
> +
> +     self parseUnicodeDataFrom: self unicodeData readStream.
> 
> Unicode class>>parseUnicodeDataFrom: {unicode data} · ct 2/24/2022 19:33 (changed)
> parseUnicodeDataFrom: stream
> - "
> -     self halt.
> -     self parseUnicodeDataFile
> - "
> +     "self initializeUnicodeData."
> 
> -     | line fieldEnd point fieldStart toNumber generalCategory decimalProperty |
> +     | line fieldEnd point fieldStart toNumber generalCategory decimalProperty tag |
> 
>     toNumber := [:quad | ('16r', quad) asNumber].
> 
>     GeneralCategory := SparseLargeTable new: 16rE0080 chunkSize: 1024 arrayClass: Array base: 1 defaultValue: 'Cn'.
>     DecimalProperty := SparseLargeTable new: 16rE0080 chunkSize: 32 arrayClass: Array base: 1 defaultValue: -1.
> 
>     16r3400 to: 16r4DB5 do: [:i | GeneralCategory at: i+1 put: 'Lo'].
>     16r4E00 to: 16r9FA5 do: [:i | GeneralCategory at: i+1 put: 'Lo'].
>     16rAC00 to: 16rD7FF do: [:i | GeneralCategory at: i+1 put: 'Lo'].
> 
>     [(line := stream nextLine) size > 0] whileTrue: [
>         fieldEnd := line indexOf: $; startingAt: 1.
>         point := toNumber value: (line copyFrom: 1 to: fieldEnd - 1).
>         point > 16rE007F ifTrue: [
>             GeneralCategory zapDefaultOnlyEntries.
>             DecimalProperty zapDefaultOnlyEntries.
>             ^ self].
>         2 to: 3 do: [:i |
>             fieldStart := fieldEnd + 1.
>             fieldEnd := line indexOf: $; startingAt: fieldStart.
>         ].
> -         generalCategory := line copyFrom: fieldStart to: fieldEnd - 1.
> +         tag := line copyFrom: fieldStart to: fieldEnd - 1.
> +         generalCategory := self generalCategoryIndexFromTag: tag.
>         GeneralCategory at: point+1 put: generalCategory.
> -         generalCategory = 'Nd' ifTrue: [
> +         generalCategory = Nd ifTrue: [
>             4 to: 7 do: [:i |
>                 fieldStart := fieldEnd + 1.
>                 fieldEnd := line indexOf: $; startingAt: fieldStart.
>             ].
>             decimalProperty := line copyFrom: fieldStart to: fieldEnd - 1.
>             DecimalProperty at: point+1 put: decimalProperty asNumber.
>         ].
>     ].
>     GeneralCategory zapDefaultOnlyEntries.
>     DecimalProperty zapDefaultOnlyEntries.
> 
> 
> UnicodeTest
> + TestCase subclass: #UnicodeTest
> +     instanceVariableNames: ''
> +     classVariableNames: ''
> +     poolDictionaries: ''
> +     category: 'MultilingualTests-Encodings'
> +
> + UnicodeTest class
> +     instanceVariableNames: ''
> +
> + ""
> 
> UnicodeTest class>>resources {accessing} · ct 2/24/2022 21:46
> + resources
> +
> +      ^ super resources copyWith: UnicodeTestResource
> 
> UnicodeTest>>testGeneralCategoryLabel {tests - character classification} · ct 2/24/2022 21:49
> + testGeneralCategoryLabel
> +
> +     self assert: 'Letter, Lowercase' equals: (Unicode generalCategoryLabelOf: $a asUnicode).
> +     self assert: 'Letter, Uppercase' equals: (Unicode generalCategoryLabelOf: $Z asUnicode).
> +     
> +     self assert: 'Number, Decimal' equals: (Unicode generalCategoryLabelOf: $5 asUnicode).
> +     self assert: 'Symbol, Other' equals: (Unicode generalCategoryLabelOf: 16r1F388).
> +     
> +     self assert: 'n/a' equals: (Unicode generalCategoryLabelOf: Float infinity).
> 
> UnicodeTest>>testGeneralCategoryLabelForTag {tests - character classification} · ct 2/24/2022 21:48
> + testGeneralCategoryLabelForTag
> +
> +     self assert: 'Letter, Lowercase' equals: (Unicode generalCategoryLabelForTag: #Ll).
> 
> UnicodeTest>>testGeneralCategoryTag {tests - character classification} · ct 2/24/2022 21:49
> + testGeneralCategoryTag
> +
> +     self assert: #Ll equals: (Unicode generalCategoryTagOf: $a asUnicode).
> +     self assert: #Lu equals: (Unicode generalCategoryTagOf: $Z asUnicode).
> +     
> +     self assert: #Nd equals: (Unicode generalCategoryTagOf: $5 asUnicode).
> +     self assert: #So equals: (Unicode generalCategoryTagOf: 16r1F388).
> +     
> +     self assert: #Cn equals: (Unicode generalCategoryTagOf: Float infinity).
> 
> UnicodeTestResource
> + TestResource subclass: #UnicodeTestResource
> +     instanceVariableNames: ''
> +     classVariableNames: ''
> +     poolDictionaries: ''
> +     category: 'MultilingualTests-Encodings'
> +
> + UnicodeTestResource class
> +     instanceVariableNames: ''
> +
> + ""
> 
> UnicodeTestResource>>setUp {running} · ct 2/24/2022 21:45
> + setUp
> +
> +     super setUp.
> +     
> +     "Test the functionality of this update logic"
> +     Unicode initializeCompositionMappings.
> +     Unicode initializeUnicodeData.
> 
> ---
> Sent from Squeak Inbox Talk [https://github.com/hpi-swa-lab/squeak-inbox-talk]
> 
> On 2020-09-11T00:49:24+02:00, leves at caesar.elte.hu wrote:
> 
> > Hi Christoph,
> >
> > On Wed, 9 Sep 2020, Thiede, Christoph wrote:
> >
> > >
> > > Hi Levente,
> > >
> > >
> > > basically, I only would like to get rid of the class variables for every single Unicode category because they provide low explorability (it's hard to work with the numeric output of #generalCategoryOf:) and extensibility (you
> > > need to recompile the class definition for adding UTF-16 support). If you are critical of increasing the size of the SparseLargeTable, I think we would also just make one or two extra dictionaries to map every category symbol
> > > to a number and vice versa. What do you think?
> >
> > You mean an array to map the integers to symbols, right? :)
> > Anyway, I don't think it's worth using symbols internally.
> > For example, #isLetterCode: is 8-10% slower with the extra array lookup
> > and checking the category symbol's first letter than the current method
> > of integer comparisons.
> >
> > Do you expect these constants to appear outside the Unicode class? If yes,
> > then using symbols for those cases is probably a good solution.
> > But for internal use, the integers are better.
> >
> >
> > Levente
> >
> > >
> > >
> > > Best,
> > >
> > > Christoph
> > >
> > >
> > > _________________________________________________________________________________________________________________________________________________________________________________________________________________________________
> > > Von: Squeak-dev <squeak-dev-bounces at lists.squeakfoundation.org> im Auftrag von Levente Uzonyi <leves at caesar.elte.hu>
> > > Gesendet: Dienstag, 8. September 2020 21:43:56
> > > An: The general-purpose Squeak developers list
> > > Betreff: Re: [squeak-dev] Unicode  
> > > Hi Christoph,
> > >
> > > On Tue, 8 Sep 2020, Thiede, Christoph wrote:
> > >
> > > >
> > > > Hi all,
> > > >
> > > >
> > > > > Your words suggest that it has already been published, but I can't find it anywhere.
> > > >
> > > >
> > > > Then I must have expressed myself wrong. I did not yet publish any code changes, but in my original post from March, you can find a short description of the design changes I'd like to implement. Essentially, I would like to
> > > > replace the separate class variables for every known character class in favor of greater flexibility.
> > >
> > > How would your changes affect GeneralCategory? Would it still be a
> > > SpareLargeTable with ByteArray as arrayClass?
> > > If you just replace those integers with symbols, the size of the table
> > > will be at least 4 or 8 times larger in 32 or 64 bit images,
> > > respectively.
> > >
> > >
> > > Levente.
> > >
> > > >
> > > > Eliot, are there any remaining questions regarding the VM size? Character size should be sufficient as discussed below, and of course, I can test any changes in a 32-bit image, too. :-)
> > > >
> > > > WideString withAll: (#(42r2l 16r1F497 16r1F388 33) collect: #asCharacter)
> > > >
> > > > Best,
> > > > Christoph
> > > >
> > > >________________________________________________________________________________________________________________________________________________________________________________________________________________________________
> > > _
> > > > Von: Squeak-dev <squeak-dev-bounces at lists.squeakfoundation.org> im Auftrag von Tobias Pape <Das.Linux at gmx.de>
> > > > Gesendet: Sonntag, 6. September 2020 21:00:14
> > > > An: The general-purpose Squeak developers list
> > > > Betreff: Re: [squeak-dev] Unicode  
> > > >
> > > > > On 06.09.2020, at 20:40, Levente Uzonyi <leves at caesar.elte.hu> wrote:
> > > > >
> > > > > On Sun, 6 Sep 2020, Tobias Pape wrote:
> > > > >
> > > > >>
> > > > >>> On 06.09.2020, at 19:15, Eliot Miranda <eliot.miranda at gmail.com> wrote:
> > > > >>> Hi Christoph, Hi All,
> > > > >>>> On Mar 17, 2020, at 3:51 PM, Thiede, Christoph <Christoph.Thiede at student.hpi.uni-potsdam.de> wrote:
> > > > >>>> Hi all! :-)
> > > > >>>> After some recent fun with the Unicode class, I found out that its data is quite out of date (for example, the comments do not even "know" that code points can be longer than 4 bytes. Younger characters such as ???
> > > > are not categorized correctly, etc. ...). Luckily, there is already some logic to fetch the latest data from www.unicode.org. I'm currently reworking this logic because it's not completely automated yet and has some slips,
> > > > but so long, I have one general question for you:
> > > > >>> And consequently I have a couple of questions for you. In the Spur VM Characters are immediate (they are like SmallInteger and exist in oops (object-oriented pointers) as tagged values).  In the 32-bit variant
> > > Characters
> > > > are 30-bit unsigned integers.  In the 64-bit variant they are also 30-bit unsigned integers, but could easily be extended to be up to 61-bit unsigned integers.
> > > > >>> Q1, can you arrange that the Unicode support does not break in initialization on the 32-bit variant?  It may be that the 32-bit variant cannot represent code points beyond 30 bits in size, but we should try to ensure
> > > that
> > > > initialization still runs to completion even if it fails to initialize information relating to code points beyond 30 bits in size.
> > > > >>> Q2, how many bits should the 64-bit variant VM support for immediate Characters?
> > > > >>
> > > > >> Unicode has a max value of 0x10FFFF. That makes 21 bit. So no worries there.
> > > > >>
> > > > >> We should just not forget the leading-char stuff (Yoshiki, Andreas,...)
> > > > >
> > > > > AFAIU the leading char only makes sense when you have multiple CJK(V?) languages in use at the same time. In other cases Unicode (leadingChar = 0) is perfectly fine.
> > > > > IIRC there are 22 bits available for the codePoint and 8 for the leadingChar, so we're still good: all unicode characters fit.
> > > > >
> > > > >
> > > >
> > > > \o/ hooray!
> > > >
> > > > > Levente
> > > > >
> > > > >>
> > > > >>
> > > > >> BEst regards
> > > > >>       -Tobias
> > > > >>
> > > > >>> Then something to consider is that it is conceptually possible to support something like WideCharacter, which would represent code points outside of the immediate Character range on the 32-bit variant, analogous to
> > > > LargePositiveInteger beyond SmallInteger maxVal.  This can be made to work seamlessly, just as it does currently with integers, and with Floats where SmallFloat64 is only used on 64-bits.
> > > > >>> It has implications in a few parts of the system:
> > > > >>> - failure code for WideString (VeryWideString?) at:[put:] primitives that would have to manage overflow into/access from WideCharacter instances
> > > > >>> - ImageSegment and other (un)pickling systems that need to convert to/from a bit-specific ?wire? protocol/representation
> > > > >>> - 32-bit <=> 64-bit image conversion All this is easily doable (because we have models of doing it for Float and Integer general instances).  But we need good specifications so we can implement the right thing from the
> > > > get-go.
> > > > >>>> At the moment, we have 30 class variables each for one Unicode category number. These class vars map in alphabetical order to the integers from 0 to: 29. Is this tedious structure really necessary? For different
> > > > purposes, I would like to get the category name of a specific code point from a client. The current design makes this impossible without writing additional mappings.
> > > > >>>> Tl;dr: I would like to propose to drop these class variables and use Symbols instead. They are comparable like integers, and as they are flyweights, this should not be a performance issue either. Of course,
> > > > #generalCategoryOf: will have to keep returning numbers, but we could deprecate it and use a new #generalTagOf: in the future. Furthermore, this would also allow us to deal with later added category names (though I don't
> > > know
> > > > whether this will ever happen).
> > > > >>>> Examples:
> > > > >>>> Unicode generalTagOf: $a asUnicode. "#Ll"
> > > > >>>> Unicode class >> isLetterCode: charCode
> > > > >>>>  ^ (self generalTagOf: charCode) first = $L
> > > > >>>> Unicode class >> isAlphaNumericCode: charCode
> > > > >>>>  | tag|
> > > > >>>>  ^ (tag := self generalCategoryOf: charCode) first = $L
> > > > >>>>        or: [tag = #Nd]
> > > > >>>> How do you think about this proposal? Please let me know and I will go ahead! :D
> > > > >>>> Best,
> > > > >>>> Christoph
> > > > >>> Best, Eliot
> > > > >>> _,,,^..^,,,_ (phone)
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> > >
> >
> >
> ["UnicodeData.2.cs"]
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20220225/b0b74049/attachment-0001.html>
> 
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20220228/2d10d9ef/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: UnicodeData.3.cs
Type: application/octet-stream
Size: 10387 bytes
Desc: not available
URL: <http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20220228/2d10d9ef/attachment.obj>


More information about the Squeak-dev mailing list