[squeak-dev] The Trunk: Collections-mt.993.mcz

Thu Mar 10 14:27:33 UTC 2022

Marcel Taeumel uploaded a new version of Collections to project The Trunk:
http://source.squeak.org/trunk/Collections-mt.993.mcz

==================== Summary ====================

Name: Collections-mt.993
Author: mt
Time: 10 March 2022, 3:27:30.144178 pm
UUID: 69da318e-87ed-0d4e-813d-cd16176618f0
Ancestors: Collections-eem.992

!!! New class comment for Character. Please read and/or review !!!

Locale clean-up. Complements Multilingual-mt.269 and System-mt.1318

=============== Diff against Collections-eem.992 ===============

Item was changed:
  ----- Method: ByteString>>applyLanguageInformation: (in category 'accessing') -----
+ applyLanguageInformation: aLanguage
+ 	"Overwritten because the receiver has latin-1 encoding and thus needs no extra language information applied."
- applyLanguageInformation: languageEnvironment
  !

Item was changed:
  ----- Method: ByteString>>convertFromSystemString (in category 'converting') -----
  convertFromSystemString

  	| readStream writeStream converter |
  	readStream := self readStream.
  	writeStream := String new writeStream.
+ 	converter := Locale currentPlatform systemConverter.
- 	converter := LanguageEnvironment defaultSystemConverter.
  	converter ifNil: [^ self].
  	[readStream atEnd] whileFalse: [
  		writeStream nextPut: (converter nextFromStream: readStream)].
  	^ writeStream contents
  !

Item was changed:
  Magnitude immediateSubclass: #Character
  	instanceVariableNames: ''
  	classVariableNames: 'AlphaNumericMask ClassificationTable DigitBit DigitValues LetterMask LowercaseBit UppercaseBit'
  	poolDictionaries: ''
  	category: 'Collections-Strings'!

+ !Character commentStamp: 'mt 3/10/2022 11:56' prior: 0!
+ I represent a character by storing its associated Unicode code point as an unsigned 30-bit value. Characters are created uniquely, so that all instances of a particular Unicode are identical.  My instances are encoded in tagged pointers in the VM, so called immediates, and therefore are pure immutable values.
- !Character commentStamp: 'eem 8/12/2014 14:53' prior: 0!
- I represent a character by storing its associated Unicode as an unsigned 30-bit value.  Characters are created uniquely, so that all instances of a particular Unicode are identical.  My instances are encoded in tagged pointers in the VM, so called immediates, and therefore are pure immutable values.

+ Here is my bit layout (1-based, check via #asInteger and #bitAt:):
+ 	1..21	Unicode code point (i.e., 0 to: 16r1FFFFF, valid up to 16r10FFFF)
+ 	22		Reserved
+ 	23..30	User data (1 byte, see below about #leadingChar or "encoding tag")
+ 	31..32	VM-specific (i.e., tagged pointers, not accessible from image)
- 	The code point is based on Unicode.  Since Unicode is 21-bit wide character set, we have several bits available for other information.  As the Unicode Standard  states, a Unicode code point doesn't carry the language information.  This is going to be a problem with the languages so called CJK (Chinese, Japanese, Korean.  Or often CJKV including Vietnamese).  Since the characters of those languages are unified and given the same code point, it is impossible to display a bare Unicode code point in an inspector or such tools.  To utilize the extra available bits, we use them for identifying the languages.  Since the old implementation uses the bits to identify the character encoding, the bits are sometimes called "encoding tag" or neutrally "leading char", but the bits rigidly denotes the concept of languages.

+ The integer value of my instances you can observe in the image can thus range from 0 up to 16r3FFFFFFF. The two highest bits are not accessible. (For simplicity, we just assume those are at the higher end. The VM can also choose to use the two lowest bits, which does not change the in-image perspective.)
+ 
+ ***
+ 
+ I. About Character Encoding
+ 
+ In the early days of Squeak, the bits in each character had the single-byte MacRoman encoding. There was only ByteString, no WideString. The VM (primitives/plugins) provided user-input values in MacRoman, the fonts expected MacRoman for glyph mapping, and source code was stored (.changes, .sources, .st, .cs) in MacRoman. The most prominent non-ASCII character was probably the #annotationSeparator ($· MacRoman code point 225).
+ 
+ With the release of Squeak 3.8 in June 2005, support for multilingualization (m17n) was added. There were now various text converters (e.g., UTF8TextConverter) and an extensible mechanism to decode platform-specific encodings for Squeak. Since byte streams from files or sockets can have any encoding, the relevant ones concerned user/keyboard input, file paths/names, the platform clipboard, and source code. See the following (sub-)classes:
+ 
+ 	- TextConverter
+ 	- KeyboardInputInterpreter
+ 	- ClipboardInterpreter
+ 
+ Squeak's internal encoding changed from MacRoman to Latin-1 (i.e., ISO 8859-1), the helper #macToSqueak was introduced, and the #annotationSeparator now had the code point 183 (via SqueakV39.sources). Since the release of Squeak 3.9 in March 2008, both .sources and .changes files have been using the UTF8TextConverter. So, actually, Squeaks character encoding was now plain Unicode code points, which includes Latin-1 for code points 0 to 255 (and ASCII from 0 to 127).
+ 
+ Older VMs provided MacRoman encoding on all supported platforms. Keyboard input, file paths, clipboard contents. With the introduction of Unicode to Squeak, VMs could now be able to provide Unicode code points (e.g., #win32VMUsesUnicode). However, VMs may just have passed on platform-specifc encodings into the image such as X11 or Windows on a Japanese platform. In that case, Squeak needs to be aware of how to decode that content to get plain Unicode code points. Read more about LanguageEnvironment below.
+ 
+ 
+ II. Unicode and Han Unification, the "Leading Char"
+ 
+ As the Unicode standard states, a Unicode code point does not carry the language information. This is a challenge for languages called CJK[V] (i.e., Chinese, Japanese, Korean, [Vietnamese]). Since the characters of those languages are unified and given the same code point, it is not possible to derive language-specific rules for text composition and display directly from characters.
+ 
+ With the release of Squeak 3.8 and its m17n support, some of the higher bits in each character were used to denote language information. Each time, platform content was decoded (e.g., via UTF8TextConverter), each character was tagged with the currently known LanguageEnvironment via Unicode class >> value: or Character class >> #leadingChar:code:. That tag is known as "encoding tag" or #leadingChar.
+ 
+ Now, the #leadingChar offers language information that can be used in text composition and display. For example, the (soft) line-break rules in Japanese are not based on whitespace but other characters. See JapaneseEnvironment class >> #isBreakableAt:in:. Also, the concept of "font sets" (see StrikeFontSet and TTCFontSet) uses the tag to select a specific (limited-range) font during text display. See TTCFontSet >> #widthOf: as an example.
+ 
+ The leading char 0 denotes a language without language-specific information. For historic reasons, this implies a rule set for text composition and display suitable for Western languages. Also, any text converter that processes Latin-1 (i.e., code point < 256) will not set its own leading character so that ByteString can still be used in such situations. For example, the Japanese leading char 5 and a Latin-1 character $a would otherwise result in "(5 << 22) + 97" and thus produce a WideString instead of a more compact ByteString.
+ 
+ See also:
+ 	Unicode class >> #value:
+ 	Character class >> #leadingChar:code:
+ 	Character >> #leadingChar
+ 	Character >> #charCode
+ 	
+ You can browse all senders of 16r3FFFFF to learn about how performance can be improved when working with characters that have a leading char 0.
+ 
+ 
+ III. Language Environment
+ 
+ At any given point in time, Squeak has a single, system-wide (i.e., global) language environment. Such a language environment drives the selection of platform-specific content convertes as well as language-specific rules for text composition and display. Each environment has its own #leadingChar, which can be used to tag characters to then find your way back during character-specific operations such as the ones in CharacterScanner. It can also be used to find the correct glyph in a font set (i.e., StrikeFontSet or TTCFontSet).
+ 
+ In theory, the language environment could be the place where users are informed about missing fonts to display text in a certain language. There is basic support for that via #isFontAvailable and #installFont. Yet, those paths do not scan the current platform for available fonts but only a remote location in the Internet (i.e., #fontDownloadUrls).
+ 
+ Note that LanguageEnvironment is part of the "Multilingual" package. Applications should therefore use the Locale interface to only depend on the "System" package.
+ 
+ See:
+ 	Locale >> #leadingChar 
+ 	String >> #applyLanguageInfomation:
+ 	Text >> #applyLanguageInfomation:
+ 
+ Also note that "language translation" via a NaturalLanguageTranslator is a mechanism independent of language environments and content encoding. That is, for example, your (platform) locale may be "ja-JP" (Japanese/JAPAN) and your environment be JapaneseEnvironment, but your translation still into English (en-US) or German (de-DE) if you prefer that.
+ 
+ 
+ IV. Chunk Format, ]lang[ tag, and UTF8 .sources/.changes
+ 
+ Most source code can be expressed with the printable portion of ASCII, that is, code points from 32 to 126. Non-printable control characters can be accessed via selected class-side methods on Character such as "Character cr" and "Character tab" and even "Character space." Only a few go up to code point 255, and even fewer beyond that. All of this is fine since source code files are encoded in UTF8.
+ 
+ See
+ 	Unicode class >> #browseMethodsWithNonAsciiEncoding
+ 	Unicode class >> #browseMethodsWithLeadingCharEncoding
+ 
+ Now, it may happen that some methods store, for example, literal strings that have their #leadingChar set. This affects string comparison (#=), which cannot ignore the leading char for performance reasons. For example, JapaneseEnvironment class >> #isBreakableAt:in: configures the CompositionScanner only when tagged a WideString is composed. One cannot easily see this but exploring the method's literals will reveal it.
+ 
+ UTF8 encoding can only handle Unicode code points and will thus discard the Squeak-specific leading char (or language information). For Squeak's chunk (i.e., source code) format, similar to the ]style[ tag for storing stand-off text attributes for a string, the language information is stored in a ]lang[ tag for all different ranges in the source string. Consequently, reading source code that has tagged characters will work.
+ 
+ BE AWARE THAT the systems current #leadingChar may differ from what is stored in source code (literals). That is, the UTF8TextConverter will apply the system's leading char via Unicode class >> value: but that will then be overwritten when the ]lang[ tag is parsed from the chunk. You will not remove these tags by accident with simple insert/remove edits BUT involving the clipboard will often trigger UTF8 conversion and then reset all leading chars on paste.
+ 
+ 
+ V. Future Work (as of March 2022)
+ 
+ The use of #leadingChar entails somewhat high maintenance costs. For its currently known use cases -- namely text composition and font selection -- it is simply not worth the effort and not even necessary.
+ 	First, text composition can easily be configured through TextStyle and TextAttribute. While the defaults may reside in the system-wide LanguageEnvironment, TextStyle can be specific to an application (or text field), and TextAttribute (maybe a new TextLanguage) can specify language per range like HTML/CSS does it. A quick check whether a certain code point is affected (e.g., cp > 255 or "is in CJK Unified Ideographs" ...) might help with performance.
+ 	Second, font selection needs to be more sophisticated than per language. There are fallback fonts, symbol fonts, language-specific fonts. Any font stack must be modeled on the basis of Unicode code points (and Unicode blocks), not a Squeak-specific encoding tag or leading char.
+ 
+ There are several decoding-specific helpers in very generic places:
+ 	- Character >> #asUnicode
+ 	- Character >> #isTraditionalDomestic
+ 	- WideString >> #isUnicodeStringWithCJK
+ 	- WideString >> #includesUnifiedCharacter
+ 	- WideString >> #mutateJISX0208StringToUnicode
+ 	- WideSymbol >> #mutateJISX0208StringToUnicode
+ 
+ I think those should be moved to the few places they are actually needed, which is around TextConverter.
+ 
+ Note that both StrikeFontSet and TTCFontSet are considered "legacy" at this point. Their use is discouraged even though they have not been deprecated along with #leadingChar yet and should work as usual.
+ 
+ Also note that if we would add a Utf8String (besides ByteString and WideString) in the future, it would not be possible any more to store user data such as the #leadingChar for the characters in such strings. This extra info can only be hold in WideString.!
- 	The other languages can have the language tag if you like.  This will help to break the large default font (font set) into separately loadable chunk of fonts.  However, it is open to the each native speakers and writers to decide how to define the character equality, since the same Unicode code point may have different language tag thus simple #= comparison may return false.!

Item was changed:
  ----- Method: Character class>>leadingChar:code: (in category 'instance creation') -----
  leadingChar: leadChar code: code

+ 	code <= 16rFF ifTrue: [ ^ self value: code "ascii or latin-1" ].
+ 	code > 16r1FFFFF ifTrue: [ self error: 'code is out of range' ].
+ 	
+ 	leadChar = 0 ifTrue: [ ^ self value: code "no language info" ].
+ 	leadChar > 16rFF ifTrue: [ self error: 'lead is out of range' ].
+ 	
+ 	^ self value: (leadChar bitShift: 22) + code!
- 	code >= 16r400000 ifTrue: [
- 		self error: 'code is out of range'.
- 	].
- 	leadChar >= 256 ifTrue: [
- 		self error: 'lead is out of range'.
- 	].
- 	code < 256 ifTrue: [ ^self value: code ].
- 	^self value: (leadChar bitShift: 22) + code.!

Item was changed:
  ----- Method: Character>>asUnicode (in category 'converting') -----
  asUnicode
+ 	"Answer the unicode encoding of the receiver. Use this method only in a TextConverter or similar. Maybe we should move this out of Character because it is very specific to the point in time where external content is decoded into Unicode code points. See senders. The indirection via #leadingChar and #encodedCharSet is unnecessarily complicated."
- 	"Answer the unicode encoding of the receiver"

  	| integerValue |
  	(integerValue := self asInteger) <= 16r3FFFFF ifTrue: [ ^integerValue ].
+ 	^self encodedCharSet convertToUnicode: (integerValue bitAnd: 16r3FFFFF)
- 	^self encodedCharSet charsetClass convertToUnicode: (integerValue bitAnd: 16r3FFFFF)
  !

Item was changed:
  ----- Method: Character>>charCode (in category 'accessing') -----
  charCode
+ 	"Drop the #leadingChar. See #leadingChar:code:."

+ 	^ self asInteger bitAnd: 16r3FFFFF!
- 	^ (self asInteger bitAnd: 16r3FFFFF).
- !

Item was changed:
  ----- Method: Character>>codePoint (in category 'accessing') -----
  codePoint
+ 	"Return the encoding value of the receiver. Until we stop supporting #leadingChar, we must forward to #charCode not #asInteger to get actual Unicode code points."
- 	"Return the encoding value of the receiver."
- 	#Fundmntl.

+ 	^ self charCode!
- 	^self asInteger!

Item was changed:
  ----- Method: Character>>encodedCharSet (in category 'accessing') -----
  encodedCharSet

+ 	self asInteger <= 16r3FFFFF ifTrue: [ ^Unicode ]. "Shortcut"
- 	self asInteger < 16r400000 ifTrue: [ ^Unicode ]. "Shortcut"
  	^EncodedCharSet charsetAt: self leadingChar
  !

Item was changed:
  ----- Method: Character>>isoToSqueak (in category 'converting') -----
  isoToSqueak 
+ 
+ 	self flag: #deprecated.
  	^self "no longer needed"!

Item was changed:
  ----- Method: Character>>squeakToIso (in category 'converting') -----
  squeakToIso
+ 
+ 	self flag: #deprecated.
  	^self "no longer needed"!

Item was changed:
  ----- Method: String>>applyLanguageInformation: (in category 'accessing') -----
+ applyLanguageInformation: aLanguage
+ 	"Apply language-specific information to the receiver. Note that aLanguage can be anything that answers #leadingChar at the moment, which includes instances of Locale, LanguageEnvironment, and (prototypical) Character. Also note that we even have to apply a leading char 0 to replace any prior language information."
- applyLanguageInformation: languageEnvironment

  	| leadingChar |
+ 	leadingChar := aLanguage leadingChar.
+ 	self withIndexDo: [:each :idx | each asInteger > 16rFF "ascii or latin-1"
+ 		ifTrue: [self at: idx put: (Character leadingChar: leadingChar code: each charCode)]].!
- 	leadingChar := languageEnvironment leadingChar.
- 	self withIndexDo: [:each :idx |
- 		each asciiValue > 255
- 			ifTrue: [self at: idx put: (Character leadingChar: leadingChar code: each asUnicode)]]!

Item was changed:
  ----- Method: String>>convertToSystemString (in category 'converting') -----
  convertToSystemString
+ 	^self convertToWithConverter: Locale currentPlatform systemConverter!
- 	^self convertToWithConverter: LanguageEnvironment defaultSystemConverter!

Item was added:
+ ----- Method: Text>>applyLanguageInformation: (in category 'accessing') -----
+ applyLanguageInformation: aLanguage
+ 	"Apply language-specific information to the receiver. Note that aLanguage can be an instance of Locale or LanguageEnvironment here."
+ 	
+ 	self flag: #todo. "mt: Add and use a TextLanguage attribute to avoid having to modify the receiver's string contents. Then we could directly use locale or locale-id and avoid Squeak's custom leadingChar. Maybe the some language info (or sane defaults) can be derived from Unicode code points and blocks."
+ 	self string applyLanguageInformation: aLanguage.!