[squeak-dev] The Trunk: Multilingual-ct.271.mcz

commits at source.squeak.org commits at source.squeak.org
Mon Apr 4 17:57:38 UTC 2022

Christoph Thiede uploaded a new version of Multilingual to project The Trunk:

==================== Summary ====================

Name: Multilingual-ct.271
Author: ct
Time: 4 April 2022, 7:57:21.50746 pm
UUID: de94b8ca-494e-d149-b2e0-e7e6d714d25b
Ancestors: Multilingual-mt.270

Merges UnicodeData.cs:
	This changeset repairs the fetching & parsing of unicode category data and adds new interface #generalCategoryTagOf: and protocol for converting between unicode categories and tags.. Usage:
		Unicode reinitializeData.
		Unicode generalCategoryLabelOf: 16r1F388. 'Symbol, Other'
	Still present limitations include:
	- Duplication between #parseUnicodeDataFrom: and #parseCompositionMappingFrom:
	- Redundant and scattered declaration of character categories

Revision from 3.cs:
	Minor clean-up in #parseUnicodeDataFrom:, fix default category values.

Note that this change will not yet upgrade your Unicode database, which will only happen when building a new image. However, you can run "Unicode reinitializeData" to benefit from the new data right now. The only reason why I did not put this into the postscript is literally to avoid any trouble with proxies or firewalls. :-)

Thanks to Levente (ul) and Marcel (mt) for their help! For more information, see: http://lists.squeakfoundation.org/pipermail/squeak-dev/2020-March/208020.html

=============== Diff against Multilingual-mt.270 ===============

Item was changed:
  EUCCNTextConverter subclass: #CNGBTextConverter
  	instanceVariableNames: ''
  	classVariableNames: ''
  	poolDictionaries: ''
  	category: 'Multilingual-TextConversion'!
+ !CNGBTextConverter commentStamp: '<historical>' prior: 0!
+ Text converter for Simplified Chinese variation of EUC.  (Even though the name doesn't look so, it is what it is.)!

Item was added:
+ ----- Method: Unicode class>>allCategoryTags (in category 'character classification') -----
+ allCategoryTags
+ 	^ #(Cn Cc Cf Co Cs Ll Lm Lo Lt Lu Mc Me Mn Nd Nl No Pc Pd Pe Pf Pi Po Ps Sc Sk Sm So Zl Zp Zs)!

Item was changed:
  ----- Method: Unicode class>>blocks320Comment (in category 'comments') -----
+ "http://www.unicode.org/Public/3.2-Update/Blocks-3.2.0.txt"
  "# Blocks-3.2.0.txt
  # Correlated with Unicode 3.2
  # Start Code..End Code; Block Name
  0000..007F; Basic Latin
  0080..00FF; Latin-1 Supplement
  0100..017F; Latin Extended-A
  0180..024F; Latin Extended-B
  0250..02AF; IPA Extensions
  02B0..02FF; Spacing Modifier Letters
  0300..036F; Combining Diacritical Marks
  0370..03FF; Greek and Coptic
  0400..04FF; Cyrillic
  0500..052F; Cyrillic Supplementary
  0530..058F; Armenian
  0590..05FF; Hebrew
  0600..06FF; Arabic
  0700..074F; Syriac
  0780..07BF; Thaana
  0900..097F; Devanagari
  0980..09FF; Bengali
  0A00..0A7F; Gurmukhi
  0A80..0AFF; Gujarati
  0B00..0B7F; Oriya
  0B80..0BFF; Tamil
  0C00..0C7F; Telugu
  0C80..0CFF; Kannada
  0D00..0D7F; Malayalam
  0D80..0DFF; Sinhala
  0E00..0E7F; Thai
  0E80..0EFF; Lao
  0F00..0FFF; Tibetan
  1000..109F; Myanmar
  10A0..10FF; Georgian
  1100..11FF; Hangul Jamo
  1200..137F; Ethiopic
  13A0..13FF; Cherokee
  1400..167F; Unified Canadian Aboriginal Syllabics
  1680..169F; Ogham
  16A0..16FF; Runic
  1700..171F; Tagalog
  1720..173F; Hanunoo
  1740..175F; Buhid
  1760..177F; Tagbanwa
  1780..17FF; Khmer
  1800..18AF; Mongolian
  1E00..1EFF; Latin Extended Additional
  1F00..1FFF; Greek Extended
  2000..206F; General Punctuation
  2070..209F; Superscripts and Subscripts
  20A0..20CF; Currency Symbols
  20D0..20FF; Combining Diacritical Marks for Symbols
  2100..214F; Letterlike Symbols
  2150..218F; Number Forms
  2190..21FF; Arrows
  2200..22FF; Mathematical Operators
  2300..23FF; Miscellaneous Technical
  2400..243F; Control Pictures
  2440..245F; Optical Character Recognition
  2460..24FF; Enclosed Alphanumerics
  2500..257F; Box Drawing
  2580..259F; Block Elements
  25A0..25FF; Geometric Shapes
  2600..26FF; Miscellaneous Symbols
  2700..27BF; Dingbats
  27C0..27EF; Miscellaneous Mathematical Symbols-A
  27F0..27FF; Supplemental Arrows-A
  2800..28FF; Braille Patterns
  2900..297F; Supplemental Arrows-B
  2980..29FF; Miscellaneous Mathematical Symbols-B
  2A00..2AFF; Supplemental Mathematical Operators
  2E80..2EFF; CJK Radicals Supplement
  2F00..2FDF; Kangxi Radicals
  2FF0..2FFF; Ideographic Description Characters
  3000..303F; CJK Symbols and Punctuation
  3040..309F; Hiragana
  30A0..30FF; Katakana
  3100..312F; Bopomofo
  3130..318F; Hangul Compatibility Jamo
  3190..319F; Kanbun
  31A0..31BF; Bopomofo Extended
  31F0..31FF; Katakana Phonetic Extensions
  3200..32FF; Enclosed CJK Letters and Months
  3300..33FF; CJK Compatibility
  3400..4DBF; CJK Unified Ideographs Extension A
  4E00..9FFF; CJK Unified Ideographs
  A000..A48F; Yi Syllables
  A490..A4CF; Yi Radicals
  AC00..D7AF; Hangul Syllables
  D800..DB7F; High Surrogates
  DB80..DBFF; High Private Use Surrogates
  DC00..DFFF; Low Surrogates
  E000..F8FF; Private Use Area
  F900..FAFF; CJK Compatibility Ideographs
  FB00..FB4F; Alphabetic Presentation Forms
  FB50..FDFF; Arabic Presentation Forms-A
  FE00..FE0F; Variation Selectors
  FE20..FE2F; Combining Half Marks
  FE30..FE4F; CJK Compatibility Forms
  FE50..FE6F; Small Form Variants
  FE70..FEFF; Arabic Presentation Forms-B
  FF00..FFEF; Halfwidth and Fullwidth Forms
  FFF0..FFFF; Specials
  10300..1032F; Old Italic
  10330..1034F; Gothic
  10400..1044F; Deseret
  1D000..1D0FF; Byzantine Musical Symbols
  1D100..1D1FF; Musical Symbols
  1D400..1D7FF; Mathematical Alphanumeric Symbols
  20000..2A6DF; CJK Unified Ideographs Extension B
  2F800..2FA1F; CJK Compatibility Ideographs Supplement
  E0000..E007F; Tags
  F0000..FFFFF; Supplementary Private Use Area-A
  100000..10FFFF; Supplementary Private Use Area-B

Item was added:
+ ----- Method: Unicode class>>generalCategoryIndexFromTag: (in category 'character classification') -----
+ generalCategoryIndexFromTag: tag
+ 	^ (self allCategoryTags indexOf: tag) - 1!

Item was added:
+ ----- Method: Unicode class>>generalCategoryLabelForTag: (in category 'character classification') -----
+ generalCategoryLabelForTag: tag
+ 	^ self generalCategoryLabels at: (self generalCategoryIndexFromTag: tag) + 1!

Item was added:
+ ----- Method: Unicode class>>generalCategoryTagOf: (in category 'character classification') -----
+ generalCategoryTagOf: aCharacterCode
+ 	^ (self generalCategoryOf: aCharacterCode)
+ 		ifNotNil: [:code | self allCategoryTags at: code + 1 ifAbsent: [#Cn]]
+ 		ifNil: [#Cn]!

Item was changed:
  ----- Method: Unicode class>>initialize (in category 'class initialization') -----
  	" Unicode initialize "
  	self initializeTagConstants.
+ 	self flag: #deduplicate. "Currently, we are downloading and parsing #unicodeData twice."
+ 	Compositions isEmptyOrNil ifTrue: [self initializeCompositionMappings].
+ 	GeneralCategory isEmptyOrNil ifTrue: [self initializeUnicodeData].!
- 	Compositions isEmptyOrNil ifTrue:[self initializeCompositionMappings].!

Item was added:
+ ----- Method: Unicode class>>initializeUnicodeData (in category 'unicode data') -----
+ initializeUnicodeData
+ 	"self initializeUnicodeData"
+ 	self parseUnicodeDataFrom: self unicodeData readStream.!

Item was changed:
  ----- Method: Unicode class>>parseUnicodeDataFrom: (in category 'unicode data') -----
  parseUnicodeDataFrom: stream
+ 	"self initializeUnicodeData."
- "
- 	self halt.
- 	self parseUnicodeDataFile
- "
+ 	| line fieldEnd point fieldStart toNumber generalCategory decimalProperty tag |
- 	| line fieldEnd point fieldStart toNumber generalCategory decimalProperty |
  	toNumber := [:quad | ('16r', quad) asNumber].
+ 	GeneralCategory := SparseLargeTable new: 16rE0080 chunkSize: 1024 arrayClass: Array base: 1 defaultValue: Cn.
- 	GeneralCategory := SparseLargeTable new: 16rE0080 chunkSize: 1024 arrayClass: Array base: 1 defaultValue:  'Cn'.
  	DecimalProperty := SparseLargeTable new: 16rE0080 chunkSize: 32 arrayClass: Array base: 1 defaultValue: -1.
+ 	GeneralCategory atAll: (16r3400 to: 16r4DB5) +1 put: Lo.
+ 	GeneralCategory atAll: (16r4E00 to: 16r9FA5) + 1 put: Lo.
+ 	GeneralCategory atAll: (16rAC00 to: 16rD7FF) + 1 put: Lo.
- 	16r3400 to: 16r4DB5 do: [:i | GeneralCategory at: i+1 put: 'Lo'].
- 	16r4E00 to: 16r9FA5 do: [:i | GeneralCategory at: i+1 put: 'Lo'].
- 	16rAC00 to: 16rD7FF do: [:i | GeneralCategory at: i+1 put: 'Lo'].
  	[(line := stream nextLine) size > 0] whileTrue: [
  		fieldEnd := line indexOf: $; startingAt: 1.
  		point := toNumber value: (line copyFrom: 1 to: fieldEnd - 1).
  		point > 16rE007F ifTrue: [
  			GeneralCategory zapDefaultOnlyEntries.
  			DecimalProperty zapDefaultOnlyEntries.
  			^ self].
  		2 to: 3 do: [:i |
  			fieldStart := fieldEnd + 1.
  			fieldEnd := line indexOf: $; startingAt: fieldStart.
+ 		tag := line copyFrom: fieldStart to: fieldEnd - 1.
+ 		generalCategory := self generalCategoryIndexFromTag: tag.
- 		generalCategory := line copyFrom: fieldStart to: fieldEnd - 1.
  		GeneralCategory at: point+1 put: generalCategory.
+ 		generalCategory = Nd ifTrue: [
- 		generalCategory = 'Nd' ifTrue: [
  			4 to: 7 do: [:i |
  				fieldStart := fieldEnd + 1.
  				fieldEnd := line indexOf: $; startingAt: fieldStart.
  			decimalProperty :=  line copyFrom: fieldStart to: fieldEnd - 1.
  			DecimalProperty at: point+1 put: decimalProperty asNumber.
  	GeneralCategory zapDefaultOnlyEntries.
+ 	DecimalProperty zapDefaultOnlyEntries.!
- 	DecimalProperty zapDefaultOnlyEntries.
- !

Item was added:
+ ----- Method: Unicode class>>reinitializeData (in category 'class initialization') -----
+ reinitializeData
+ 	Compositions := GeneralCategory := nil.
+ 	self initialize.!

More information about the Squeak-dev mailing list