[squeak-dev] The Trunk: Multilingual-ct.271.mcz
commits at source.squeak.org
commits at source.squeak.org
Mon Apr 4 17:57:38 UTC 2022
Christoph Thiede uploaded a new version of Multilingual to project The Trunk:
http://source.squeak.org/trunk/Multilingual-ct.271.mcz
==================== Summary ====================
Name: Multilingual-ct.271
Author: ct
Time: 4 April 2022, 7:57:21.50746 pm
UUID: de94b8ca-494e-d149-b2e0-e7e6d714d25b
Ancestors: Multilingual-mt.270
Merges UnicodeData.cs:
This changeset repairs the fetching & parsing of unicode category data and adds new interface #generalCategoryTagOf: and protocol for converting between unicode categories and tags.. Usage:
Unicode reinitializeData.
Unicode generalCategoryLabelOf: 16r1F388. 'Symbol, Other'
Still present limitations include:
- Duplication between #parseUnicodeDataFrom: and #parseCompositionMappingFrom:
- Redundant and scattered declaration of character categories
Revision from 3.cs:
Minor clean-up in #parseUnicodeDataFrom:, fix default category values.
Note that this change will not yet upgrade your Unicode database, which will only happen when building a new image. However, you can run "Unicode reinitializeData" to benefit from the new data right now. The only reason why I did not put this into the postscript is literally to avoid any trouble with proxies or firewalls. :-)
Thanks to Levente (ul) and Marcel (mt) for their help! For more information, see: http://lists.squeakfoundation.org/pipermail/squeak-dev/2020-March/208020.html
=============== Diff against Multilingual-mt.270 ===============
Item was changed:
EUCCNTextConverter subclass: #CNGBTextConverter
instanceVariableNames: ''
classVariableNames: ''
poolDictionaries: ''
category: 'Multilingual-TextConversion'!
+
+ !CNGBTextConverter commentStamp: '<historical>' prior: 0!
+ Text converter for Simplified Chinese variation of EUC. (Even though the name doesn't look so, it is what it is.)!
Item was added:
+ ----- Method: Unicode class>>allCategoryTags (in category 'character classification') -----
+ allCategoryTags
+
+ ^ #(Cn Cc Cf Co Cs Ll Lm Lo Lt Lu Mc Me Mn Nd Nl No Pc Pd Pe Pf Pi Po Ps Sc Sk Sm So Zl Zp Zs)!
Item was changed:
----- Method: Unicode class>>blocks320Comment (in category 'comments') -----
blocks320Comment
+ "http://www.unicode.org/Public/3.2-Update/Blocks-3.2.0.txt"
"# Blocks-3.2.0.txt
# Correlated with Unicode 3.2
# Start Code..End Code; Block Name
0000..007F; Basic Latin
0080..00FF; Latin-1 Supplement
0100..017F; Latin Extended-A
0180..024F; Latin Extended-B
0250..02AF; IPA Extensions
02B0..02FF; Spacing Modifier Letters
0300..036F; Combining Diacritical Marks
0370..03FF; Greek and Coptic
0400..04FF; Cyrillic
0500..052F; Cyrillic Supplementary
0530..058F; Armenian
0590..05FF; Hebrew
0600..06FF; Arabic
0700..074F; Syriac
0780..07BF; Thaana
0900..097F; Devanagari
0980..09FF; Bengali
0A00..0A7F; Gurmukhi
0A80..0AFF; Gujarati
0B00..0B7F; Oriya
0B80..0BFF; Tamil
0C00..0C7F; Telugu
0C80..0CFF; Kannada
0D00..0D7F; Malayalam
0D80..0DFF; Sinhala
0E00..0E7F; Thai
0E80..0EFF; Lao
0F00..0FFF; Tibetan
1000..109F; Myanmar
10A0..10FF; Georgian
1100..11FF; Hangul Jamo
1200..137F; Ethiopic
13A0..13FF; Cherokee
1400..167F; Unified Canadian Aboriginal Syllabics
1680..169F; Ogham
16A0..16FF; Runic
1700..171F; Tagalog
1720..173F; Hanunoo
1740..175F; Buhid
1760..177F; Tagbanwa
1780..17FF; Khmer
1800..18AF; Mongolian
1E00..1EFF; Latin Extended Additional
1F00..1FFF; Greek Extended
2000..206F; General Punctuation
2070..209F; Superscripts and Subscripts
20A0..20CF; Currency Symbols
20D0..20FF; Combining Diacritical Marks for Symbols
2100..214F; Letterlike Symbols
2150..218F; Number Forms
2190..21FF; Arrows
2200..22FF; Mathematical Operators
2300..23FF; Miscellaneous Technical
2400..243F; Control Pictures
2440..245F; Optical Character Recognition
2460..24FF; Enclosed Alphanumerics
2500..257F; Box Drawing
2580..259F; Block Elements
25A0..25FF; Geometric Shapes
2600..26FF; Miscellaneous Symbols
2700..27BF; Dingbats
27C0..27EF; Miscellaneous Mathematical Symbols-A
27F0..27FF; Supplemental Arrows-A
2800..28FF; Braille Patterns
2900..297F; Supplemental Arrows-B
2980..29FF; Miscellaneous Mathematical Symbols-B
2A00..2AFF; Supplemental Mathematical Operators
2E80..2EFF; CJK Radicals Supplement
2F00..2FDF; Kangxi Radicals
2FF0..2FFF; Ideographic Description Characters
3000..303F; CJK Symbols and Punctuation
3040..309F; Hiragana
30A0..30FF; Katakana
3100..312F; Bopomofo
3130..318F; Hangul Compatibility Jamo
3190..319F; Kanbun
31A0..31BF; Bopomofo Extended
31F0..31FF; Katakana Phonetic Extensions
3200..32FF; Enclosed CJK Letters and Months
3300..33FF; CJK Compatibility
3400..4DBF; CJK Unified Ideographs Extension A
4E00..9FFF; CJK Unified Ideographs
A000..A48F; Yi Syllables
A490..A4CF; Yi Radicals
AC00..D7AF; Hangul Syllables
D800..DB7F; High Surrogates
DB80..DBFF; High Private Use Surrogates
DC00..DFFF; Low Surrogates
E000..F8FF; Private Use Area
F900..FAFF; CJK Compatibility Ideographs
FB00..FB4F; Alphabetic Presentation Forms
FB50..FDFF; Arabic Presentation Forms-A
FE00..FE0F; Variation Selectors
FE20..FE2F; Combining Half Marks
FE30..FE4F; CJK Compatibility Forms
FE50..FE6F; Small Form Variants
FE70..FEFF; Arabic Presentation Forms-B
FF00..FFEF; Halfwidth and Fullwidth Forms
FFF0..FFFF; Specials
10300..1032F; Old Italic
10330..1034F; Gothic
10400..1044F; Deseret
1D000..1D0FF; Byzantine Musical Symbols
1D100..1D1FF; Musical Symbols
1D400..1D7FF; Mathematical Alphanumeric Symbols
20000..2A6DF; CJK Unified Ideographs Extension B
2F800..2FA1F; CJK Compatibility Ideographs Supplement
E0000..E007F; Tags
F0000..FFFFF; Supplementary Private Use Area-A
100000..10FFFF; Supplementary Private Use Area-B
"!
Item was added:
+ ----- Method: Unicode class>>generalCategoryIndexFromTag: (in category 'character classification') -----
+ generalCategoryIndexFromTag: tag
+
+ ^ (self allCategoryTags indexOf: tag) - 1!
Item was added:
+ ----- Method: Unicode class>>generalCategoryLabelForTag: (in category 'character classification') -----
+ generalCategoryLabelForTag: tag
+
+ ^ self generalCategoryLabels at: (self generalCategoryIndexFromTag: tag) + 1!
Item was added:
+ ----- Method: Unicode class>>generalCategoryTagOf: (in category 'character classification') -----
+ generalCategoryTagOf: aCharacterCode
+
+ ^ (self generalCategoryOf: aCharacterCode)
+ ifNotNil: [:code | self allCategoryTags at: code + 1 ifAbsent: [#Cn]]
+ ifNil: [#Cn]!
Item was changed:
----- Method: Unicode class>>initialize (in category 'class initialization') -----
initialize
" Unicode initialize "
self initializeTagConstants.
+
+ self flag: #deduplicate. "Currently, we are downloading and parsing #unicodeData twice."
+ Compositions isEmptyOrNil ifTrue: [self initializeCompositionMappings].
+ GeneralCategory isEmptyOrNil ifTrue: [self initializeUnicodeData].!
- Compositions isEmptyOrNil ifTrue:[self initializeCompositionMappings].!
Item was added:
+ ----- Method: Unicode class>>initializeUnicodeData (in category 'unicode data') -----
+ initializeUnicodeData
+ "self initializeUnicodeData"
+
+ self parseUnicodeDataFrom: self unicodeData readStream.!
Item was changed:
----- Method: Unicode class>>parseUnicodeDataFrom: (in category 'unicode data') -----
parseUnicodeDataFrom: stream
+ "self initializeUnicodeData."
- "
- self halt.
- self parseUnicodeDataFile
- "
+ | line fieldEnd point fieldStart toNumber generalCategory decimalProperty tag |
- | line fieldEnd point fieldStart toNumber generalCategory decimalProperty |
toNumber := [:quad | ('16r', quad) asNumber].
+ GeneralCategory := SparseLargeTable new: 16rE0080 chunkSize: 1024 arrayClass: Array base: 1 defaultValue: Cn.
- GeneralCategory := SparseLargeTable new: 16rE0080 chunkSize: 1024 arrayClass: Array base: 1 defaultValue: 'Cn'.
DecimalProperty := SparseLargeTable new: 16rE0080 chunkSize: 32 arrayClass: Array base: 1 defaultValue: -1.
+ GeneralCategory atAll: (16r3400 to: 16r4DB5) +1 put: Lo.
+ GeneralCategory atAll: (16r4E00 to: 16r9FA5) + 1 put: Lo.
+ GeneralCategory atAll: (16rAC00 to: 16rD7FF) + 1 put: Lo.
- 16r3400 to: 16r4DB5 do: [:i | GeneralCategory at: i+1 put: 'Lo'].
- 16r4E00 to: 16r9FA5 do: [:i | GeneralCategory at: i+1 put: 'Lo'].
- 16rAC00 to: 16rD7FF do: [:i | GeneralCategory at: i+1 put: 'Lo'].
[(line := stream nextLine) size > 0] whileTrue: [
fieldEnd := line indexOf: $; startingAt: 1.
point := toNumber value: (line copyFrom: 1 to: fieldEnd - 1).
point > 16rE007F ifTrue: [
GeneralCategory zapDefaultOnlyEntries.
DecimalProperty zapDefaultOnlyEntries.
^ self].
2 to: 3 do: [:i |
fieldStart := fieldEnd + 1.
fieldEnd := line indexOf: $; startingAt: fieldStart.
].
+ tag := line copyFrom: fieldStart to: fieldEnd - 1.
+ generalCategory := self generalCategoryIndexFromTag: tag.
- generalCategory := line copyFrom: fieldStart to: fieldEnd - 1.
GeneralCategory at: point+1 put: generalCategory.
+ generalCategory = Nd ifTrue: [
- generalCategory = 'Nd' ifTrue: [
4 to: 7 do: [:i |
fieldStart := fieldEnd + 1.
fieldEnd := line indexOf: $; startingAt: fieldStart.
].
decimalProperty := line copyFrom: fieldStart to: fieldEnd - 1.
DecimalProperty at: point+1 put: decimalProperty asNumber.
].
].
GeneralCategory zapDefaultOnlyEntries.
+ DecimalProperty zapDefaultOnlyEntries.!
- DecimalProperty zapDefaultOnlyEntries.
- !
Item was added:
+ ----- Method: Unicode class>>reinitializeData (in category 'class initialization') -----
+ reinitializeData
+
+ Compositions := GeneralCategory := nil.
+ self initialize.!
More information about the Squeak-dev
mailing list
|