[squeak-dev] The Trunk: Collections-nice.933.mcz

Levente Uzonyi leves at caesar.elte.hu
Sat Apr 10 20:13:13 UTC 2021


Hi Nicolas,

Nice analysis. I'm surprised I didn't notice either me making the 
copy-paste mistake in PositionableStream >> #basicUpTo: and Andreas 
removing the optimized version from MultiByteFileStream.
As with all basic* methods, PositionableStream >> #basicUpTo: 
should simply be

 	^self upTo: anObject

I restored the removed version of MultiByteFileStream >> #basicUpTo: in my 
image. It makes reading the sources file 3x faster using the following 
"benchmark":

| file |
[
 	| hash |
 	file := (SourceFiles at: 1) readOnlyCopy.
 	hash := 0.
 	{
 		[ [ file atEnd ] whileFalse: [
 			hash := (hash bitXor: file nextChunk hash) hashMultiply ] ] timeToRun.
 		hash } ]
 	ensure: [ file ifNotNil: [ file close ] ].

"#(1414 265702489) naive"
"#(469 265702489) optimized"

So, I suggest restoring MultiByteFileStream >> #basicUpTo:, and changing 
PositionableStream >> #basicUpTo: to the version suggested above.
Also, PositionableStream >> #upTo: could use #streamContents: instead of 
duplicating its behavior.


Levente

On Sat, 10 Apr 2021, Nicolas Cellier wrote:

> Hi all,
> currently, MultiByteFileStream nextChunk is incorrect for utf8 stream:
> it attempts to decode utf8 twice.
>
> This crafted example will raise an error:
>
>    (MultiByteFileStream newFileNamed: 'foo.utf8') nextPutAll: 'À'; close.
>    (MultiByteFileStream oldFileNamed: 'foo.utf8') nextChunk.
>
> nextChunk sends basicUpTo: with the intention to get an un-converted
> string, then sends decodeString: to have fast (bulk) decoding.
>
> Unfortunately basicUpTo: sends next instead of basicNext... This makes
> the utf8 decoded twice, which can falsify the source code in some
> cases, or even fail with an Error like in the crafted example above.
>
> an accelerated version of basicUpTo: was provided by Levente in
> Multilingual-ul.85.mcz
> but was removed in Multilingual-ar.119.mcz, and I didn't understand
> the intention by reading the commit message...
> basicUpTo: was then broken in Collections-ul.438.mcz, and fixed in
> Collections-eem.684.mcz with the double decoding problem.
>
>
> Le sam. 10 avr. 2021 à 21:17, <commits at source.squeak.org> a écrit :
>>
>> Nicolas Cellier uploaded a new version of Collections to project The Trunk:
>> http://source.squeak.org/trunk/Collections-nice.933.mcz
>>
>> ==================== Summary ====================
>>
>> Name: Collections-nice.933
>> Author: nice
>> Time: 10 April 2021, 9:16:54.312889 pm
>> UUID: c066bf52-b7a5-474a-9614-90bbc3212e07
>> Ancestors: Collections-ul.932
>>
>> Quick fix for double utf8->squeak conversion via nextChunk.
>>
>>     (MultiByteFileStream newFileNamed: 'foo.utf8') nextPutAll: 'À'; close.
>>     (MultiByteFileStream oldFileNamed: 'foo.utf8') nextChunk.
>>
>> =============== Diff against Collections-ul.932 ===============
>>
>> Item was changed:
>>   ----- Method: PositionableStream>>basicUpTo: (in category 'private basic') -----
>>   basicUpTo: anObject
>>         "Answer a subcollection from the current access position to the
>>         occurrence (if any, but not inclusive) of anObject in the receiver. If
>>         anObject is not in the collection, answer the entire rest of the receiver."
>>         | newStream element |
>>         newStream := WriteStream on: (self collectionSpecies new: 100).
>> +       [self atEnd or: [(element := self basicNext) = anObject]]
>> -       [self atEnd or: [(element := self next) = anObject]]
>>                 whileFalse: [newStream nextPut: element].
>>         ^newStream contents!
>>
>>


More information about the Squeak-dev mailing list