[squeak-dev] Unicode

Jakob Reschke jakres+squeak at gmail.com
Tue Apr 5 12:40:47 UTC 2022


Would it be possible/practical to separate this?
* Download and transform into code/objects
* Distribute the generated code/objects via the update stream directly
One could run the download&generate step as needed to update the data. (CI,
release build, manually)
Or are there any reasons not to do that?

I was asking myself the same thing recently about the package that provides
time zone information. It needs a Unix timezone database from the operating
system to initialize, rather than providing Smalltalk objects/code directly
in Monticello, based on the official online database.

Kind regards,
Jakob

Am Di., 5. Apr. 2022 um 08:59 Uhr schrieb Marcel Taeumel <
marcel.taeumel at hpi.de>:

> Hi Eliot, hi Christoph --
>
> > Unicode reinitializeData.
>
> I think that this method has an unfortunate name. Since it downloads data
> from the Internet, it should be called #downloadAndInitializeData.
>
> And that's the reason for it not being in the post-load script. We might
> put the raw info there, but it would be very surprising if "Update Squeak"
> fetches data other than from source.squeak.org.
>
> Best,
> Marcel
>
> Am 04.04.2022 21:17:17 schrieb Thiede, Christoph <
> christoph.thiede at student.hpi.uni-potsdam.de>:
>
> > If this is essential
>
>
> Well, how do you define essential? You can still use your image with the
> old font definitions. However, for some newer codepoints such as 😁😎😍, Unicode
> generalCategoryLabelOf: and friends will answers "not assigned" without the
> upgrade. You can watch the difference by browsing any comprehensive font in
> the FontImporter. But I am not aware of any code path that relies on the
> presence of newer Unicode data.
>
>
> Apart from that, I was already discussing with Marcel what would be the
> consequences of downloading data from a third-party server during an image
> update. There might be any images, most likely server images, that do not
> have free internet access due to a strict firewall. Hypothetically, this
> might even introduce any security issues. So in the end, we decided on
> leaving this optional for now. It will only break if any future patch of
> any package relies on exact Unicode data.
>
>
> Best,
>
> Christoph
> ------------------------------
> *Von:* Squeak-dev <squeak-dev-bounces at lists.squeakfoundation.org> im
> Auftrag von Eliot Miranda <eliot.miranda at gmail.com>
> *Gesendet:* Montag, 4. April 2022 20:54:59
> *An:* The general-purpose Squeak developers list
> *Betreff:* Re: [squeak-dev] Unicode
>
>
>
> On Apr 4, 2022, at 11:20 AM, Christoph.Thiede at student.hpi.uni-potsdam.de
> wrote:
>
> Merged via Multilingual-ct.271, Multilingual-ct.272,
> MultilingualTests-ct.41, and ReleaseBuilder-ct.231.
>
> Please run the following in your image to install the new Unicode data
> (and to uncover any regressions I may have missed :D):
>
>     Unicode reinitializeData.
>
>
> If this is essential then it *must* be added as a post load script to one
> (or more) of the relevant packages.  Asking “did you run Unicode reinitializeData?”
> when someone reports a strange bug isn’t acceptable.
>
>
> Best,
> Christoph
>
> ---
> *Sent from **Squeak Inbox Talk
> <https://github.com/hpi-swa-lab/squeak-inbox-talk>*
>
> On 2022-02-28T16:09:06+01:00, christoph.thiede at student.hpi.uni-potsdam.de
> wrote:
>
> > Hi Marcel, thanks for the review! Below is an updated changeset. If you
> have no further objections, I would like to merge it within the next few
> days. :-)
> >
> > =============== Summary ===============
> >
> > Change Set:????????UnicodeData
> > Date:????????????24 February 2022
> > Author:????????????Christoph Thiede
> >
> > This changeset repairs the fetching & parsing of unicode category data.
> Usage:
> >
> > ????Unicode reinitializeData.
> > ????Unicode generalCategoryLabelOf: 16r1F388. 'Symbol, Other'
> >
> > This revision resolves some slips in the category tags, adds an
> interface for retrieving/converting tags, unifies the vocabulary of the
> Unicode protocol, integrates the #initializeUnicodeData into the class
> initializer, and adds some tests. Furthermore, the Unicode data are
> automatically reinitialized as part of the ReleaseBuilder.
> >
> > Still present limitations include:
> > - Duplication between #parseUnicodeDataFrom: and
> #parseCompositionMappingFrom:
> > - Redundant and scattered declaration of character categories
> >
> > For more information, see:
> http://lists.squeakfoundation.org/pipermail/squeak-dev/2020-March/208020.html
> >
> > =============== Postscript ===============
> >
> > "Postscript:
> > Leave the line above, and replace the rest of this comment by a useful
> one.
> > Executable statements should follow this comment, and should
> > be separated by periods, with no exclamation points (!).
> > Be sure to put any further comments in double-quotes, like this one."
> >
> >
> > =============== Diff ===============
> >
> > ReleaseBuilder class>>prepareSourceCode {preparing} ? ct 2/28/2022 15:54
> (changed)
> > prepareSourceCode
> > ????"Update packages. Remove foreign packages. Recompile."
> >
> > ????CurrentReadOnlySourceFiles cacheDuring:
> > ????????[self
> > ????????????updateCorePackages;
> > ????????????unloadForeignPackages;
> > ????????????checkForDirtyPackages;
> > ????????????loadWellKnownPackages;
> > ????????????checkForUndeclaredSymbols;
> > ????????????checkForNilCategories;
> > - ????????????recompileAll]
> > + ????????????recompileAll;
> > + ????????????updateDatabases]
> >
> > ReleaseBuilder class>>updateDatabases {scripts - support} ? ct 2/28/2022
> 16:06
> > + updateDatabases
> > +
> > + ????Unicode reinitializeData.
> >
> > Unicode class>>allCategoryTags {character classification} ? ct 2/24/2022
> 19:41
> > + allCategoryTags
> > +
> > + ????^ #(Cn Cc Cf Co Cs Ll Lm Lo Lt Lu Mc Me Mn Nd Nl No Pc Pd Pe Pf Pi
> Po Ps Sc Sk Sm So Zl Zp Zs)
> >
> > Unicode class>>blocks320Comment {comments} ? ct 2/28/2022 15:50 (changed)
> > blocks320Comment
> > + "http://www.unicode.org/Public/3.2-Update/Blocks-3.2.0.txt"
> >
> > "# Blocks-3.2.0.txt
> > # Correlated with Unicode 3.2
> > # Start Code..End Code; Block Name
> > 0000..007F; Basic Latin
> > 0080..00FF; Latin-1 Supplement
> > 0100..017F; Latin Extended-A
> > 0180..024F; Latin Extended-B
> > 0250..02AF; IPA Extensions
> > 02B0..02FF; Spacing Modifier Letters
> > 0300..036F; Combining Diacritical Marks
> > 0370..03FF; Greek and Coptic
> > 0400..04FF; Cyrillic
> > 0500..052F; Cyrillic Supplementary
> > 0530..058F; Armenian
> > 0590..05FF; Hebrew
> > 0600..06FF; Arabic
> > 0700..074F; Syriac
> > 0780..07BF; Thaana
> > 0900..097F; Devanagari
> > 0980..09FF; Bengali
> > 0A00..0A7F; Gurmukhi
> > 0A80..0AFF; Gujarati
> > 0B00..0B7F; Oriya
> > 0B80..0BFF; Tamil
> > 0C00..0C7F; Telugu
> > 0C80..0CFF; Kannada
> > 0D00..0D7F; Malayalam
> > 0D80..0DFF; Sinhala
> > 0E00..0E7F; Thai
> > 0E80..0EFF; Lao
> > 0F00..0FFF; Tibetan
> > 1000..109F; Myanmar
> > 10A0..10FF; Georgian
> > 1100..11FF; Hangul Jamo
> > 1200..137F; Ethiopic
> > 13A0..13FF; Cherokee
> > 1400..167F; Unified Canadian Aboriginal Syllabics
> > 1680..169F; Ogham
> > 16A0..16FF; Runic
> > 1700..171F; Tagalog
> > 1720..173F; Hanunoo
> > 1740..175F; Buhid
> > 1760..177F; Tagbanwa
> > 1780..17FF; Khmer
> > 1800..18AF; Mongolian
> > 1E00..1EFF; Latin Extended Additional
> > 1F00..1FFF; Greek Extended
> > 2000..206F; General Punctuation
> > 2070..209F; Superscripts and Subscripts
> > 20A0..20CF; Currency Symbols
> > 20D0..20FF; Combining Diacritical Marks for Symbols
> > 2100..214F; Letterlike Symbols
> > 2150..218F; Number Forms
> > 2190..21FF; Arrows
> > 2200..22FF; Mathematical Operators
> > 2300..23FF; Miscellaneous Technical
> > 2400..243F; Control Pictures
> > 2440..245F; Optical Character Recognition
> > 2460..24FF; Enclosed Alphanumerics
> > 2500..257F; Box Drawing
> > 2580..259F; Block Elements
> > 25A0..25FF; Geometric Shapes
> > 2600..26FF; Miscellaneous Symbols
> > 2700..27BF; Dingbats
> > 27C0..27EF; Miscellaneous Mathematical Symbols-A
> > 27F0..27FF; Supplemental Arrows-A
> > 2800..28FF; Braille Patterns
> > 2900..297F; Supplemental Arrows-B
> > 2980..29FF; Miscellaneous Mathematical Symbols-B
> > 2A00..2AFF; Supplemental Mathematical Operators
> > 2E80..2EFF; CJK Radicals Supplement
> > 2F00..2FDF; Kangxi Radicals
> > 2FF0..2FFF; Ideographic Description Characters
> > 3000..303F; CJK Symbols and Punctuation
> > 3040..309F; Hiragana
> > 30A0..30FF; Katakana
> > 3100..312F; Bopomofo
> > 3130..318F; Hangul Compatibility Jamo
> > 3190..319F; Kanbun
> > 31A0..31BF; Bopomofo Extended
> > 31F0..31FF; Katakana Phonetic Extensions
> > 3200..32FF; Enclosed CJK Letters and Months
> > 3300..33FF; CJK Compatibility
> > 3400..4DBF; CJK Unified Ideographs Extension A
> > 4E00..9FFF; CJK Unified Ideographs
> > A000..A48F; Yi Syllables
> > A490..A4CF; Yi Radicals
> > AC00..D7AF; Hangul Syllables
> > D800..DB7F; High Surrogates
> > DB80..DBFF; High Private Use Surrogates
> > DC00..DFFF; Low Surrogates
> > E000..F8FF; Private Use Area
> > F900..FAFF; CJK Compatibility Ideographs
> > FB00..FB4F; Alphabetic Presentation Forms
> > FB50..FDFF; Arabic Presentation Forms-A
> > FE00..FE0F; Variation Selectors
> > FE20..FE2F; Combining Half Marks
> > FE30..FE4F; CJK Compatibility Forms
> > FE50..FE6F; Small Form Variants
> > FE70..FEFF; Arabic Presentation Forms-B
> > FF00..FFEF; Halfwidth and Fullwidth Forms
> > FFF0..FFFF; Specials
> > 10300..1032F; Old Italic
> > 10330..1034F; Gothic
> > 10400..1044F; Deseret
> > 1D000..1D0FF; Byzantine Musical Symbols
> > 1D100..1D1FF; Musical Symbols
> > 1D400..1D7FF; Mathematical Alphanumeric Symbols
> > 20000..2A6DF; CJK Unified Ideographs Extension B
> > 2F800..2FA1F; CJK Compatibility Ideographs Supplement
> > E0000..E007F; Tags
> > F0000..FFFFF; Supplementary Private Use Area-A
> > 100000..10FFFF; Supplementary Private Use Area-B
> >
> >
> > "
> >
> > Unicode class>>generalCategoryIndexFromTag: {character classification} ?
> ct 2/24/2022 19:33
> > + generalCategoryIndexFromTag: tag
> > +
> > + ????^ (self allCategoryTags indexOf: tag) - 1
> >
> > Unicode class>>generalCategoryLabelForTag: {character classification} ?
> ct 2/24/2022 21:47
> > + generalCategoryLabelForTag: tag
> > +
> > + ????^ self generalCategoryLabels at: (self
> generalCategoryIndexFromTag: tag) + 1
> >
> > Unicode class>>generalCategoryTagOf: {character classification} ? ct
> 2/24/2022 19:44
> > + generalCategoryTagOf: aCharacterCode
> > +
> > + ????^ (self generalCategoryOf: aCharacterCode)
> > + ????????ifNotNil: [:code | self allCategoryTags at: code + 1 ifAbsent:
> [#Cn]]
> > + ????????ifNil: [#Cn]
> >
> > Unicode class>>initialize {class initialization} ? ct 2/28/2022 15:52
> (changed)
> > initialize
> > ????" Unicode initialize "
> > ????self initializeTagConstants.
> > - ????Compositions isEmptyOrNil ifTrue:[self
> initializeCompositionMappings].
> > + ????
> > + ????self flag: #deduplicate. "Currently, we are downloading and
> parsing #unicodeData twice."
> > + ????Compositions isEmptyOrNil ifTrue: [self
> initializeCompositionMappings].
> > + ????GeneralCategory isEmptyOrNil ifTrue: [self initializeUnicodeData].
> >
> > Unicode class>>initializeUnicodeData {unicode data} ? ct 2/24/2022 19:03
> > + initializeUnicodeData
> > + ????"self initializeUnicodeData"
> > +
> > + ????self parseUnicodeDataFrom: self unicodeData readStream.
> >
> > Unicode class>>parseUnicodeDataFrom: {unicode data} ? ct 2/24/2022 19:33
> (changed)
> > parseUnicodeDataFrom: stream
> > - "
> > - ????self halt.
> > - ????self parseUnicodeDataFile
> > - "
> > + ????"self initializeUnicodeData."
> >
> > - ????| line fieldEnd point fieldStart toNumber generalCategory
> decimalProperty |
> > + ????| line fieldEnd point fieldStart toNumber generalCategory
> decimalProperty tag |
> >
> > ????toNumber := [:quad | ('16r', quad) asNumber].
> >
> > ????GeneralCategory := SparseLargeTable new: 16rE0080 chunkSize: 1024
> arrayClass: Array base: 1 defaultValue: 'Cn'.
> > ????DecimalProperty := SparseLargeTable new: 16rE0080 chunkSize: 32
> arrayClass: Array base: 1 defaultValue: -1.
> >
> > ????16r3400 to: 16r4DB5 do: [:i | GeneralCategory at: i+1 put: 'Lo'].
> > ????16r4E00 to: 16r9FA5 do: [:i | GeneralCategory at: i+1 put: 'Lo'].
> > ????16rAC00 to: 16rD7FF do: [:i | GeneralCategory at: i+1 put: 'Lo'].
> >
> > ????[(line := stream nextLine) size > 0] whileTrue: [
> > ????????fieldEnd := line indexOf: $; startingAt: 1.
> > ????????point := toNumber value: (line copyFrom: 1 to: fieldEnd - 1).
> > ????????point > 16rE007F ifTrue: [
> > ????????????GeneralCategory zapDefaultOnlyEntries.
> > ????????????DecimalProperty zapDefaultOnlyEntries.
> > ????????????^ self].
> > ????????2 to: 3 do: [:i |
> > ????????????fieldStart := fieldEnd + 1.
> > ????????????fieldEnd := line indexOf: $; startingAt: fieldStart.
> > ????????].
> > - ????????generalCategory := line copyFrom: fieldStart to: fieldEnd - 1.
> > + ????????tag := line copyFrom: fieldStart to: fieldEnd - 1.
> > + ????????generalCategory := self generalCategoryIndexFromTag: tag.
> > ????????GeneralCategory at: point+1 put: generalCategory.
> > - ????????generalCategory = 'Nd' ifTrue: [
> > + ????????generalCategory = Nd ifTrue: [
> > ????????????4 to: 7 do: [:i |
> > ????????????????fieldStart := fieldEnd + 1.
> > ????????????????fieldEnd := line indexOf: $; startingAt: fieldStart.
> > ????????????].
> > ????????????decimalProperty := line copyFrom: fieldStart to: fieldEnd -
> 1.
> > ????????????DecimalProperty at: point+1 put: decimalProperty asNumber.
> > ????????].
> > ????].
> > ????GeneralCategory zapDefaultOnlyEntries.
> > ????DecimalProperty zapDefaultOnlyEntries.
> >
> >
> > Unicode class>>reinitializeData {class initialization} ? ct 2/28/2022
> 16:05
> > + reinitializeData
> > +
> > + ????Compositions := GeneralCategory := nil.
> > + ????self initialize.
> >
> > UnicodeTest
> > + TestCase subclass: #UnicodeTest
> > + ????instanceVariableNames: ''
> > + ????classVariableNames: ''
> > + ????poolDictionaries: ''
> > + ????category: 'MultilingualTests-Encodings'
> > +
> > + UnicodeTest class
> > + ????instanceVariableNames: ''
> > +
> > + ""
> >
> > UnicodeTest class>>resources {accessing} ? ct 2/24/2022 21:46
> > + resources
> > +
> > + ???? ^ super resources copyWith: UnicodeTestResource
> >
> > UnicodeTest>>testGeneralCategoryLabel {tests - character classification}
> ? ct 2/24/2022 21:49
> > + testGeneralCategoryLabel
> > +
> > + ????self assert: 'Letter, Lowercase' equals: (Unicode
> generalCategoryLabelOf: $a asUnicode).
> > + ????self assert: 'Letter, Uppercase' equals: (Unicode
> generalCategoryLabelOf: $Z asUnicode).
> > + ????
> > + ????self assert: 'Number, Decimal' equals: (Unicode
> generalCategoryLabelOf: $5 asUnicode).
> > + ????self assert: 'Symbol, Other' equals: (Unicode
> generalCategoryLabelOf: 16r1F388).
> > + ????
> > + ????self assert: 'n/a' equals: (Unicode generalCategoryLabelOf: Float
> infinity).
> >
> > UnicodeTest>>testGeneralCategoryLabelForTag {tests - character
> classification} ? ct 2/24/2022 21:48
> > + testGeneralCategoryLabelForTag
> > +
> > + ????self assert: 'Letter, Lowercase' equals: (Unicode
> generalCategoryLabelForTag: #Ll).
> >
> > UnicodeTest>>testGeneralCategoryTag {tests - character classification} ?
> ct 2/24/2022 21:49
> > + testGeneralCategoryTag
> > +
> > + ????self assert: #Ll equals: (Unicode generalCategoryTagOf: $a
> asUnicode).
> > + ????self assert: #Lu equals: (Unicode generalCategoryTagOf: $Z
> asUnicode).
> > + ????
> > + ????self assert: #Nd equals: (Unicode generalCategoryTagOf: $5
> asUnicode).
> > + ????self assert: #So equals: (Unicode generalCategoryTagOf: 16r1F388).
> > + ????
> > + ????self assert: #Cn equals: (Unicode generalCategoryTagOf: Float
> infinity).
> >
> > UnicodeTestResource
> > + TestResource subclass: #UnicodeTestResource
> > + ????instanceVariableNames: ''
> > + ????classVariableNames: ''
> > + ????poolDictionaries: ''
> > + ????category: 'MultilingualTests-Encodings'
> > +
> > + UnicodeTestResource class
> > + ????instanceVariableNames: ''
> > +
> > + ""
> >
> > UnicodeTestResource>>setUp {running} ? ct 2/24/2022 21:45
> > + setUp
> > +
> > + ????super setUp.
> > + ????
> > + ????"Test the functionality of this update logic"
> > + ????Unicode initializeCompositionMappings.
> > + ????Unicode initializeUnicodeData.
> >
> > ---
> > Sent from Squeak Inbox Talk
> >
> > On 2022-02-25T11:36:53+01:00, marcel.taeumel at hpi.de wrote:
> >
> > > Hi Christoph --
> > >
> > > Thanks for doing this!
> > >
> > > >??Is it okay to fetch the data from unicode.org via a postscript in
> the update stream? Hypothetically, some clients might use a proxy or a
> strict firewall/safelist of IP addresses.
> > >
> > > I think it is okay. We could make it explicit in the ReleaseBuilder or
> in some external CI script but having it part of the update stream is okay.
> The updates are fetched from the outside anyway, right? ;-)
> > >
> > > Best;
> > > Marcel
> > > Am 24.02.2022 22:16:42 schrieb christoph.thiede at
> student.hpi.uni-potsdam.de <christoph.thiede at student.hpi.uni-potsdam.de
> >:
> > > Hi all, Hi Marcel, Hi Levente,
> > >
> > > finally, here is a changeset that takes the first step for updating or
> in-image Unicode database. After filing it in, please run:
> > >
> > > ????Unicode initializeUnicodeData.
> > >
> > > In addition to the preamble (please read first below), I have still a
> number of questions:
> > >
> > > * Is it okay to fetch the data from unicode.org via a postscript in
> the update stream? Hypothetically, some clients might use a proxy or a
> strict firewall/safelist of IP addresses.
> > > * How much effort shall we put in deduplicating the logic and the data
> in this class? This includes both the two similar parsing methods and the
> redundant specification of the Unicode character tags.
> > >
> > > Best,
> > > Christoph
> > >
> > > =============== Summary ===============
> > >
> > > Change Set:????????UnicodeData
> > > Date:????????????24 February 2022
> > > Author:????????????Christoph Thiede
> > >
> > > This changeset repairs the fetching & parsing of unicode category
> data. Usage:
> > >
> > > ????Unicode initializeUnicodeData.
> > > ????Unicode generalCategoryLabelOf: 16r1F388. 'Symbol, Other'
> > >
> > > This revision resolves some slips in the category tags, adds an
> interface for retrieving/converting tags, unifies the vocabulary of the
> Unicode protocol, integrates the #initializeUnicodeData into the class
> initializer, and adds some tests.
> > >
> > > Still present limitations include:
> > > - Duplication between #parseUnicodeDataFrom: and
> #parseCompositionMappingFrom:
> > > - Redundant and scattered declaration of character categories
> > >
> > > For more information, see:
> http://lists.squeakfoundation.org/pipermail/squeak-dev/2020-March/208020.html
> > >
> > > =============== Postscript ===============
> > >
> > > "Postscript:
> > > Leave the line above, and replace the rest of this comment by a useful
> one.
> > > Executable statements should follow this comment, and should
> > > be separated by periods, with no exclamation points (!).
> > > Be sure to put any further comments in double-quotes, like this one."
> > >
> > >
> > > =============== Diff ===============
> > >
> > > Unicode class>>allCategoryTags {character classification} ? ct
> 2/24/2022 19:41
> > > + allCategoryTags
> > > +
> > > + ????^ #(Cn Cc Cf Co Cs Ll Lm Lo Lt Lu Mc Me Mn Nd Nl No Pc Pd Pe Pf
> Pi Po Ps Sc Sk Sm So Zl Zp Zs)
> > >
> > > Unicode class>>generalCategoryIndexFromTag: {character classification}
> ? ct 2/24/2022 19:33
> > > + generalCategoryIndexFromTag: tag
> > > +
> > > + ????^ (self allCategoryTags indexOf: tag) - 1
> > >
> > > Unicode class>>generalCategoryLabelForTag: {character classification}
> ? ct 2/24/2022 21:47
> > > + generalCategoryLabelForTag: tag
> > > +
> > > + ????^ self generalCategoryLabels at: (self
> generalCategoryIndexFromTag: tag) + 1
> > >
> > > Unicode class>>generalCategoryTagOf: {character classification} ? ct
> 2/24/2022 19:44
> > > + generalCategoryTagOf: aCharacterCode
> > > +
> > > + ????^ (self generalCategoryOf: aCharacterCode)
> > > + ????????ifNotNil: [:code | self allCategoryTags at: code + 1
> ifAbsent: [#Cn]]
> > > + ????????ifNil: [#Cn]
> > >
> > > Unicode class>>initialize {class initialization} ? ct 2/24/2022 21:43
> (changed)
> > > initialize
> > > ????" Unicode initialize "
> > > ????self initializeTagConstants.
> > > - ????Compositions isEmptyOrNil ifTrue:[self
> initializeCompositionMappings].
> > > + ????Compositions isEmptyOrNil ifTrue:[self
> initializeCompositionMappings].
> > > + ????GeneralCategory isEmptyOrNil ifTrue: [self
> initializeUnicodeData].
> > >
> > > Unicode class>>initializeUnicodeData {unicode data} ? ct 2/24/2022
> 19:03
> > > + initializeUnicodeData
> > > + ????"self initializeUnicodeData"
> > > +
> > > + ????self parseUnicodeDataFrom: self unicodeData readStream.
> > >
> > > Unicode class>>parseUnicodeDataFrom: {unicode data} ? ct 2/24/2022
> 19:33 (changed)
> > > parseUnicodeDataFrom: stream
> > > - "
> > > - ????self halt.
> > > - ????self parseUnicodeDataFile
> > > - "
> > > + ????"self initializeUnicodeData."
> > >
> > > - ????| line fieldEnd point fieldStart toNumber generalCategory
> decimalProperty |
> > > + ????| line fieldEnd point fieldStart toNumber generalCategory
> decimalProperty tag |
> > >
> > > ????toNumber := [:quad | ('16r', quad) asNumber].
> > >
> > > ????GeneralCategory := SparseLargeTable new: 16rE0080 chunkSize: 1024
> arrayClass: Array base: 1 defaultValue: 'Cn'.
> > > ????DecimalProperty := SparseLargeTable new: 16rE0080 chunkSize: 32
> arrayClass: Array base: 1 defaultValue: -1.
> > >
> > > ????16r3400 to: 16r4DB5 do: [:i | GeneralCategory at: i+1 put: 'Lo'].
> > > ????16r4E00 to: 16r9FA5 do: [:i | GeneralCategory at: i+1 put: 'Lo'].
> > > ????16rAC00 to: 16rD7FF do: [:i | GeneralCategory at: i+1 put: 'Lo'].
> > >
> > > ????[(line := stream nextLine) size > 0] whileTrue: [
> > > ????????fieldEnd := line indexOf: $; startingAt: 1.
> > > ????????point := toNumber value: (line copyFrom: 1 to: fieldEnd - 1).
> > > ????????point > 16rE007F ifTrue: [
> > > ????????????GeneralCategory zapDefaultOnlyEntries.
> > > ????????????DecimalProperty zapDefaultOnlyEntries.
> > > ????????????^ self].
> > > ????????2 to: 3 do: [:i |
> > > ????????????fieldStart := fieldEnd + 1.
> > > ????????????fieldEnd := line indexOf: $; startingAt: fieldStart.
> > > ????????].
> > > - ????????generalCategory := line copyFrom: fieldStart to: fieldEnd -
> 1.
> > > + ????????tag := line copyFrom: fieldStart to: fieldEnd - 1.
> > > + ????????generalCategory := self generalCategoryIndexFromTag: tag.
> > > ????????GeneralCategory at: point+1 put: generalCategory.
> > > - ????????generalCategory = 'Nd' ifTrue: [
> > > + ????????generalCategory = Nd ifTrue: [
> > > ????????????4 to: 7 do: [:i |
> > > ????????????????fieldStart := fieldEnd + 1.
> > > ????????????????fieldEnd := line indexOf: $; startingAt: fieldStart.
> > > ????????????].
> > > ????????????decimalProperty := line copyFrom: fieldStart to: fieldEnd
> - 1.
> > > ????????????DecimalProperty at: point+1 put: decimalProperty asNumber.
> > > ????????].
> > > ????].
> > > ????GeneralCategory zapDefaultOnlyEntries.
> > > ????DecimalProperty zapDefaultOnlyEntries.
> > >
> > >
> > > UnicodeTest
> > > + TestCase subclass: #UnicodeTest
> > > + ????instanceVariableNames: ''
> > > + ????classVariableNames: ''
> > > + ????poolDictionaries: ''
> > > + ????category: 'MultilingualTests-Encodings'
> > > +
> > > + UnicodeTest class
> > > + ????instanceVariableNames: ''
> > > +
> > > + ""
> > >
> > > UnicodeTest class>>resources {accessing} ? ct 2/24/2022 21:46
> > > + resources
> > > +
> > > + ???? ^ super resources copyWith: UnicodeTestResource
> > >
> > > UnicodeTest>>testGeneralCategoryLabel {tests - character
> classification} ? ct 2/24/2022 21:49
> > > + testGeneralCategoryLabel
> > > +
> > > + ????self assert: 'Letter, Lowercase' equals: (Unicode
> generalCategoryLabelOf: $a asUnicode).
> > > + ????self assert: 'Letter, Uppercase' equals: (Unicode
> generalCategoryLabelOf: $Z asUnicode).
> > > + ????
> > > + ????self assert: 'Number, Decimal' equals: (Unicode
> generalCategoryLabelOf: $5 asUnicode).
> > > + ????self assert: 'Symbol, Other' equals: (Unicode
> generalCategoryLabelOf: 16r1F388).
> > > + ????
> > > + ????self assert: 'n/a' equals: (Unicode generalCategoryLabelOf:
> Float infinity).
> > >
> > > UnicodeTest>>testGeneralCategoryLabelForTag {tests - character
> classification} ? ct 2/24/2022 21:48
> > > + testGeneralCategoryLabelForTag
> > > +
> > > + ????self assert: 'Letter, Lowercase' equals: (Unicode
> generalCategoryLabelForTag: #Ll).
> > >
> > > UnicodeTest>>testGeneralCategoryTag {tests - character classification}
> ? ct 2/24/2022 21:49
> > > + testGeneralCategoryTag
> > > +
> > > + ????self assert: #Ll equals: (Unicode generalCategoryTagOf: $a
> asUnicode).
> > > + ????self assert: #Lu equals: (Unicode generalCategoryTagOf: $Z
> asUnicode).
> > > + ????
> > > + ????self assert: #Nd equals: (Unicode generalCategoryTagOf: $5
> asUnicode).
> > > + ????self assert: #So equals: (Unicode generalCategoryTagOf:
> 16r1F388).
> > > + ????
> > > + ????self assert: #Cn equals: (Unicode generalCategoryTagOf: Float
> infinity).
> > >
> > > UnicodeTestResource
> > > + TestResource subclass: #UnicodeTestResource
> > > + ????instanceVariableNames: ''
> > > + ????classVariableNames: ''
> > > + ????poolDictionaries: ''
> > > + ????category: 'MultilingualTests-Encodings'
> > > +
> > > + UnicodeTestResource class
> > > + ????instanceVariableNames: ''
> > > +
> > > + ""
> > >
> > > UnicodeTestResource>>setUp {running} ? ct 2/24/2022 21:45
> > > + setUp
> > > +
> > > + ????super setUp.
> > > + ????
> > > + ????"Test the functionality of this update logic"
> > > + ????Unicode initializeCompositionMappings.
> > > + ????Unicode initializeUnicodeData.
> > >
> > > ---
> > > Sent from Squeak Inbox Talk [
> https://github.com/hpi-swa-lab/squeak-inbox-talk]
> > >
> > > On 2020-09-11T00:49:24+02:00, leves at caesar.elte.hu wrote:
> > >
> > > > Hi Christoph,
> > > >
> > > > On Wed, 9 Sep 2020, Thiede, Christoph wrote:
> > > >
> > > > >
> > > > > Hi Levente,
> > > > >
> > > > >
> > > > > basically, I only?would like to get rid of the class variables for
> every single Unicode category because they provide low explorability (it's
> hard to work with the numeric output of #generalCategoryOf:) and
> extensibility (you
> > > > > need to recompile the class definition for adding UTF-16 support).
> If you are critical of?increasing the?size of the SparseLargeTable, I think
> we?would also just make one or two?extra dictionaries to map every category
> symbol
> > > > > to a number and vice versa. What do you think?
> > > >
> > > > You mean an array to map the integers to symbols, right? :)
> > > > Anyway, I don't think it's worth using symbols internally.
> > > > For example, #isLetterCode: is 8-10% slower with the extra array
> lookup
> > > > and checking the category symbol's first letter than the current
> method
> > > > of integer comparisons.
> > > >
> > > > Do you expect these constants to appear outside the Unicode class?
> If yes,
> > > > then using symbols for those cases is probably a good solution.
> > > > But for internal use, the integers are better.
> > > >
> > > >
> > > > Levente
> > > >
> > > > >
> > > > >
> > > > > Best,
> > > > >
> > > > > Christoph
> > > > >
> > > > >
> > > > >
> _________________________________________________________________________________________________________________________________________________________________________________________________________________________________
> > > > > Von: Squeak-dev <squeak-dev-bounces at lists.squeakfoundation.org>
> im Auftrag von Levente Uzonyi <leves at caesar.elte.hu>
> > > > > Gesendet: Dienstag, 8. September 2020 21:43:56
> > > > > An: The general-purpose Squeak developers list
> > > > > Betreff: Re: [squeak-dev] Unicode ?
> > > > > Hi Christoph,
> > > > >
> > > > > On Tue, 8 Sep 2020, Thiede, Christoph wrote:
> > > > >
> > > > > >
> > > > > > Hi all,
> > > > > >
> > > > > >
> > > > > > >?Your words suggest that it has already been published, but I
> can't find it?anywhere.
> > > > > >
> > > > > >
> > > > > > Then I must have expressed myself wrong. I did not yet publish
> any code?changes, but in my original post from March,?you can find a short
> description of the design?changes I'd like to implement. Essentially, I
> would like to
> > > > > > replace the separate class variables for every known character
> class in favor of greater flexibility.
> > > > >
> > > > > How would your changes affect GeneralCategory? Would it still be a
> > > > > SpareLargeTable with ByteArray as arrayClass?
> > > > > If you just replace those integers with symbols, the size of the
> table
> > > > > will be at least 4 or 8 times larger in 32 or 64 bit images,
> > > > > respectively.
> > > > >
> > > > >
> > > > > Levente.
> > > > >
> > > > > >
> > > > > > Eliot, are there any remaining questions regarding the VM size?
> Character size should be sufficient as discussed below, and of course, I
> can test any changes in a 32-bit image, too. :-)
> > > > > >
> > > > > > WideString withAll: (#(42r2l 16r1F497 16r1F388 33) collect:
> #asCharacter)
> > > > > >
> > > > > > Best,
> > > > > > Christoph
> > > > > >
> > > > >
> >________________________________________________________________________________________________________________________________________________________________________________________________________________________________
> > > > > _
> > > > > > Von: Squeak-dev <squeak-dev-bounces at
> lists.squeakfoundation.org> im Auftrag von Tobias Pape <Das.Linux at
> gmx.de>
> > > > > > Gesendet: Sonntag, 6. September 2020 21:00:14
> > > > > > An: The general-purpose Squeak developers list
> > > > > > Betreff: Re: [squeak-dev] Unicode ?
> > > > > >
> > > > > > > On 06.09.2020, at 20:40, Levente Uzonyi <leves at
> caesar.elte.hu> wrote:
> > > > > > >
> > > > > > > On Sun, 6 Sep 2020, Tobias Pape wrote:
> > > > > > >
> > > > > > >>
> > > > > > >>> On 06.09.2020, at 19:15, Eliot Miranda <eliot.miranda at
> gmail.com> wrote:
> > > > > > >>> Hi Christoph, Hi All,
> > > > > > >>>> On Mar 17, 2020, at 3:51 PM, Thiede, Christoph
> <Christoph.Thiede at student.hpi.uni-potsdam.de> wrote:
> > > > > > >>>> Hi all! :-)
> > > > > > >>>> After some recent fun with the Unicode class, I found out
> that its data is quite out of date (for example, the comments do not even
> "know" that code points can be longer than 4 bytes. Younger characters such
> as ???
> > > > > > are not categorized correctly, etc. ...). Luckily, there is
> already some logic to fetch the latest data from www.unicode.org. I'm
> currently reworking this logic because it's not completely automated yet
> and has some slips,
> > > > > > but so long, I have one general question for you:
> > > > > > >>> And consequently I have a couple of questions for you. In
> the Spur VM Characters are immediate (they are like SmallInteger and exist
> in oops (object-oriented pointers) as tagged values).? In the 32-bit variant
> > > > > Characters
> > > > > > are 30-bit unsigned integers.? In the 64-bit variant they are
> also 30-bit unsigned integers, but could easily be extended to be up to
> 61-bit unsigned integers.
> > > > > > >>> Q1, can you arrange that the Unicode support does not break
> in initialization on the 32-bit variant?? It may be that the 32-bit variant
> cannot represent code points beyond 30 bits in size, but we should try to
> ensure
> > > > > that
> > > > > > initialization still runs to completion even if it fails to
> initialize information relating to code points beyond 30 bits in size.
> > > > > > >>> Q2, how many bits should the 64-bit variant VM support for
> immediate Characters?
> > > > > > >>
> > > > > > >> Unicode has a max value of 0x10FFFF. That makes 21 bit. So no
> worries there.
> > > > > > >>
> > > > > > >> We should just not forget the leading-char stuff (Yoshiki,
> Andreas,...)
> > > > > > >
> > > > > > > AFAIU the leading char only makes sense when you have multiple
> CJK(V?) languages in use at the same time. In other cases Unicode
> (leadingChar = 0) is perfectly fine.
> > > > > > > IIRC there are 22 bits available for the codePoint and 8 for
> the leadingChar, so we're still good: all unicode characters fit.
> > > > > > >
> > > > > > >
> > > > > >
> > > > > > \o/ hooray!
> > > > > >
> > > > > > > Levente
> > > > > > >
> > > > > > >>
> > > > > > >>
> > > > > > >> BEst regards
> > > > > > >>?????? -Tobias
> > > > > > >>
> > > > > > >>> Then something to consider is that it is conceptually
> possible to support something like WideCharacter, which would represent
> code points outside of the immediate Character range on the 32-bit variant,
> analogous to
> > > > > > LargePositiveInteger beyond SmallInteger maxVal.? This can be
> made to work seamlessly, just as it does currently with integers, and with
> Floats where SmallFloat64 is only used on 64-bits.
> > > > > > >>> It has implications in a few parts of the system:
> > > > > > >>> - failure code for WideString (VeryWideString?) at:[put:]
> primitives that would have to manage overflow into/access from
> WideCharacter instances
> > > > > > >>> - ImageSegment and other (un)pickling systems that need to
> convert to/from a bit-specific ?wire? protocol/representation
> > > > > > >>> - 32-bit <=> 64-bit image conversion All this is easily
> doable (because we have models of doing it for Float and Integer general
> instances).? But we need good specifications so we can implement the right
> thing from the
> > > > > > get-go.
> > > > > > >>>> At the moment, we have 30 class variables each for one
> Unicode category number. These class vars map in alphabetical order to the
> integers from 0 to: 29. Is this tedious structure really necessary? For
> different
> > > > > > purposes, I would like to get the category name of a specific
> code point from a client. The current design makes this impossible without
> writing additional mappings.
> > > > > > >>>> Tl;dr: I would like to propose to drop these class
> variables and use Symbols instead. They are comparable like integers, and
> as they are flyweights, this should not be a performance issue either. Of
> course,
> > > > > > #generalCategoryOf: will have to keep returning numbers, but we
> could deprecate it and use a new #generalTagOf: in the future. Furthermore,
> this would also allow us to deal with later added category names (though I
> don't
> > > > > know
> > > > > > whether this will ever happen).
> > > > > > >>>> Examples:
> > > > > > >>>> Unicode generalTagOf: $a asUnicode. "#Ll"
> > > > > > >>>> Unicode class >> isLetterCode: charCode
> > > > > > >>>>? ^ (self generalTagOf: charCode) first = $L
> > > > > > >>>> Unicode class >> isAlphaNumericCode: charCode
> > > > > > >>>>? | tag|
> > > > > > >>>>? ^ (tag := self generalCategoryOf: charCode) first = $L
> > > > > > >>>>??????? or: [tag = #Nd]
> > > > > > >>>> How do you think about this proposal? Please let me know
> and I will go ahead! :D
> > > > > > >>>> Best,
> > > > > > >>>> Christoph
> > > > > > >>> Best, Eliot
> > > > > > >>> _,,,^..^,,,_ (phone)
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > >
> > > >
> > > ["UnicodeData.2.cs"]
> > > -------------- next part --------------
> > > An HTML attachment was scrubbed...
> > > URL: <
> http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20220225/b0b74049/attachment-0001.html
> >
> > >
> > >
> > -------------- next part --------------
> > An HTML attachment was scrubbed...
> > URL: <
> http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20220228/2d10d9ef/attachment.html
> >
> > -------------- next part --------------
> > A non-text attachment was scrubbed...
> > Name: UnicodeData.3.cs
> > Type: application/octet-stream
> > Size: 10387 bytes
> > Desc: not available
> > URL: <
> http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20220228/2d10d9ef/attachment.obj
> >
> >
> >
> ["UnicodeData.png"]
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20220405/3c86bcbe/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: UnicodeData.png
Type: image/png
Size: 298829 bytes
Desc: not available
URL: <http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20220405/3c86bcbe/attachment-0001.png>


More information about the Squeak-dev mailing list