[squeak-dev] The Trunk: Collections-nice.933.mcz
Levente Uzonyi
leves at caesar.elte.hu
Sat Apr 10 20:13:13 UTC 2021
Hi Nicolas,
Nice analysis. I'm surprised I didn't notice either me making the
copy-paste mistake in PositionableStream >> #basicUpTo: and Andreas
removing the optimized version from MultiByteFileStream.
As with all basic* methods, PositionableStream >> #basicUpTo:
should simply be
^self upTo: anObject
I restored the removed version of MultiByteFileStream >> #basicUpTo: in my
image. It makes reading the sources file 3x faster using the following
"benchmark":
| file |
[
| hash |
file := (SourceFiles at: 1) readOnlyCopy.
hash := 0.
{
[ [ file atEnd ] whileFalse: [
hash := (hash bitXor: file nextChunk hash) hashMultiply ] ] timeToRun.
hash } ]
ensure: [ file ifNotNil: [ file close ] ].
"#(1414 265702489) naive"
"#(469 265702489) optimized"
So, I suggest restoring MultiByteFileStream >> #basicUpTo:, and changing
PositionableStream >> #basicUpTo: to the version suggested above.
Also, PositionableStream >> #upTo: could use #streamContents: instead of
duplicating its behavior.
Levente
On Sat, 10 Apr 2021, Nicolas Cellier wrote:
> Hi all,
> currently, MultiByteFileStream nextChunk is incorrect for utf8 stream:
> it attempts to decode utf8 twice.
>
> This crafted example will raise an error:
>
> (MultiByteFileStream newFileNamed: 'foo.utf8') nextPutAll: 'À'; close.
> (MultiByteFileStream oldFileNamed: 'foo.utf8') nextChunk.
>
> nextChunk sends basicUpTo: with the intention to get an un-converted
> string, then sends decodeString: to have fast (bulk) decoding.
>
> Unfortunately basicUpTo: sends next instead of basicNext... This makes
> the utf8 decoded twice, which can falsify the source code in some
> cases, or even fail with an Error like in the crafted example above.
>
> an accelerated version of basicUpTo: was provided by Levente in
> Multilingual-ul.85.mcz
> but was removed in Multilingual-ar.119.mcz, and I didn't understand
> the intention by reading the commit message...
> basicUpTo: was then broken in Collections-ul.438.mcz, and fixed in
> Collections-eem.684.mcz with the double decoding problem.
>
>
> Le sam. 10 avr. 2021 à 21:17, <commits at source.squeak.org> a écrit :
>>
>> Nicolas Cellier uploaded a new version of Collections to project The Trunk:
>> http://source.squeak.org/trunk/Collections-nice.933.mcz
>>
>> ==================== Summary ====================
>>
>> Name: Collections-nice.933
>> Author: nice
>> Time: 10 April 2021, 9:16:54.312889 pm
>> UUID: c066bf52-b7a5-474a-9614-90bbc3212e07
>> Ancestors: Collections-ul.932
>>
>> Quick fix for double utf8->squeak conversion via nextChunk.
>>
>> (MultiByteFileStream newFileNamed: 'foo.utf8') nextPutAll: 'À'; close.
>> (MultiByteFileStream oldFileNamed: 'foo.utf8') nextChunk.
>>
>> =============== Diff against Collections-ul.932 ===============
>>
>> Item was changed:
>> ----- Method: PositionableStream>>basicUpTo: (in category 'private basic') -----
>> basicUpTo: anObject
>> "Answer a subcollection from the current access position to the
>> occurrence (if any, but not inclusive) of anObject in the receiver. If
>> anObject is not in the collection, answer the entire rest of the receiver."
>> | newStream element |
>> newStream := WriteStream on: (self collectionSpecies new: 100).
>> + [self atEnd or: [(element := self basicNext) = anObject]]
>> - [self atEnd or: [(element := self next) = anObject]]
>> whileFalse: [newStream nextPut: element].
>> ^newStream contents!
>>
>>
More information about the Squeak-dev
mailing list
|