[squeak-dev] The Inbox: Multilingual-pre.238.mcz

Sat May 26 08:02:20 UTC 2018

Patrick Rein uploaded a new version of Multilingual to project The Inbox:
http://source.squeak.org/inbox/Multilingual-pre.238.mcz

==================== Summary ====================

Name: Multilingual-pre.238
Author: pre
Time: 26 May 2018, 10:01:56.809551 am
UUID: 3f43777d-957b-374e-8bc6-2254763879f5
Ancestors: Multilingual-nice.237

Fixes the failing UTF-16 bug due to a missing initialization of the latin1 map.
Also adds validation for overlong sequences in UTF-8. Refactors some of the UTF-8 conversion code to make the bitmasks more obvious. The performance hit from the validation seems to be negligible but further testing is required.

=============== Diff against Multilingual-nice.237 ===============

Item was changed:
  ----- Method: ByteTextConverter class>>initialize (in category 'class initialization') -----
  initialize

        self == ByteTextConverter 
  		ifTrue: [self allSubclassesDo: [:c | c initialize]]
+ 		ifFalse: [self
- 		ifFalse: [self 
  					initializeDecodeTable; 
  					initializeEncodeTable; 
  					initializeLatin1MapAndEncodings]
  !

Item was added:
+ ----- Method: UTF16TextConverter class>>initializeLatin1MapAndEncodings (in category 'utilities') -----
+ initializeLatin1MapAndEncodings
+ 	"Initialize the latin1Map and latin1Encodings.
+ 	These variables ensure that conversions from latin1 ByteString is reasonably fast"
+ 	
+ 	latin1Map := (ByteArray new: 256) atAllPut: 1.
+ 	latin1Encodings := (0 to: 255) collect: [:i | (ByteArray newFrom: {0 . i}) asString]!

Item was changed:
  ----- Method: UTF8TextConverter class>>decodeByteString: (in category 'conversion') -----
  decodeByteString: aByteString
  	"Convert the given string from UTF-8 using the fast path if converting to Latin-1"

+ 	| outStream lastIndex nextIndex limit byte1 byte2 byte3 byte4 unicode continuationByteMask |
- 	| outStream lastIndex nextIndex limit byte1 byte2 byte3 byte4 unicode |
  	lastIndex := 1.
  	(nextIndex := ByteString findFirstInString: aByteString inSet: latin1Map startingAt: lastIndex) = 0
  		ifTrue: [ ^aByteString ].
  	limit := aByteString size.
  	outStream := (String new: limit) writeStream.
+ 	continuationByteMask := 2r00111111.
  	[
  		outStream next: nextIndex - lastIndex putAll: aByteString startingAt: lastIndex.
  		byte1 := aByteString byteAt: nextIndex.
+ 		
+ 		"The byte range checks are separated into single checks to allow for implementing recovery --pre
+ 		For the rules see: http://www.unicode.org/versions/Unicode7.0.0/UnicodeStandard-7.0.pdf page 125 table 3-7"
+ 		(byte1 bitAnd: 2r11100000) = 2r11000000 ifTrue: [ "two bytes"
+ 			nextIndex < limit ifFalse: [ ^self errorMalformedInput: aByteString ].
- 		(byte1 bitAnd: 16rE0) = 192 ifTrue: [ "two bytes"
- 			nextIndex < limit ifFalse: [ ^ self errorMalformedInput: aByteString ].
  			byte2 := aByteString byteAt: (nextIndex := nextIndex + 1).
+ 			(byte1 < 2r11000010) ifTrue: [ ^ self errorMalformedInput: aByteString ]. "other requirements are covered by initial bit mask"
+ 			(byte2 bitAnd: 16rC0) = 16r80 ifFalse: [^ self errorMalformedInput: aByteString].  
+ 			unicode := ((byte1 bitAnd: 2r00011111) bitShift: 6) + (byte2 bitAnd: continuationByteMask)].
+ 		
+ 		(byte1 bitAnd: 2r11110000) = 2r11100000 ifTrue: [ "three bytes"
- 			(byte2 bitAnd: 16rC0) = 16r80 ifFalse:[	^self errorMalformedInput: aByteString ].
- 			unicode := ((byte1 bitAnd: 31) bitShift: 6) + (byte2 bitAnd: 63)].
- 		(byte1 bitAnd: 16rF0) = 224 ifTrue: [ "three bytes"
  			(nextIndex + 2) <= limit ifFalse: [ ^ self errorMalformedInput: aByteString ].
  			byte2 := aByteString byteAt: (nextIndex := nextIndex + 1).
  			(byte2 bitAnd: 16rC0) = 16r80 ifFalse:[ ^self errorMalformedInput: aByteString ].
+ 			((byte1 bitAnd: 2r00001111) = 2r0 and: [byte2 < 2r10100000]) ifTrue: [ ^self errorMalformedInput: aByteString ].
+ 			((byte1 = 2r11101101) and: [(byte2 bitAnd: 2r00100000) = 2r00100000]) ifTrue: [ 
+ 				"reserved codepoints"
+ 				^self errorMalformedInput: aByteString ].
  			byte3 := aByteString byteAt: (nextIndex := nextIndex + 1).
  			(byte3 bitAnd: 16rC0) = 16r80 ifFalse:[ ^self errorMalformedInput: aByteString ].
+ 			unicode := ((byte1 bitAnd: 2r00001111) bitShift: 12) + ((byte2 bitAnd: continuationByteMask) bitShift: 6)
+ 				+ (byte3 bitAnd: continuationByteMask)].
+ 			
+ 		(byte1 bitAnd: 2r11111000) = 2r11110000 ifTrue: [ "four bytes"
- 			unicode := ((byte1 bitAnd: 15) bitShift: 12) + ((byte2 bitAnd: 63) bitShift: 6)
- 				+ (byte3 bitAnd: 63)].
- 		(byte1 bitAnd: 16rF8) = 240 ifTrue: [ "four bytes"
  			(nextIndex + 3) <= limit ifFalse: [ ^ self errorMalformedInput: aByteString ].
  			byte2 := aByteString byteAt: (nextIndex := nextIndex + 1).
  			(byte2 bitAnd: 16rC0) = 16r80 ifFalse:[ ^self errorMalformedInput: aByteString ].
+ 			((byte1 = 2r11110000) and: [byte2 < 2r10010000]) ifTrue: [ ^self errorMalformedInput: aByteString ].
  			byte3 := aByteString byteAt: (nextIndex := nextIndex + 1).
  			(byte3 bitAnd: 16rC0) = 16r80 ifFalse:[ ^self errorMalformedInput: aByteString ].
  			byte4 := aByteString byteAt: (nextIndex := nextIndex + 1).
  			(byte4 bitAnd: 16rC0) = 16r80 ifFalse:[ ^self errorMalformedInput: aByteString ].
+ 			unicode := ((byte1 bitAnd: 2r00000111) bitShift: 18) +
+ 							((byte2 bitAnd: continuationByteMask) bitShift: 12) + 
+ 							((byte3 bitAnd: continuationByteMask) bitShift: 6) +
+ 							(byte4 bitAnd: continuationByteMask)].
+ 						
- 			unicode := ((byte1 bitAnd: 16r7) bitShift: 18) +
- 							((byte2 bitAnd: 63) bitShift: 12) + 
- 							((byte3 bitAnd: 63) bitShift: 6) +
- 							(byte4 bitAnd: 63)].
  		unicode ifNil: [ ^self errorMalformedInput: aByteString ].
  		unicode = 16rFEFF ifFalse: [ "Skip byte order mark"
  			outStream nextPut: (Unicode value: unicode) ].
  		lastIndex := nextIndex + 1.
  		(nextIndex := ByteString findFirstInString: aByteString inSet: latin1Map startingAt: lastIndex) = 0 ] whileFalse.
  	^outStream 
  		next: aByteString size - lastIndex + 1 putAll: aByteString startingAt: lastIndex;
  		contents
  !