[squeak-dev] MultiByteFileStream upToAll: strange bug

Bob Arning arning315 at comcast.net
Sun Jan 21 03:00:47 UTC 2018


The problem occurs when crossing the 2kb size of the StandardFileStream 
variable <collection>. The </A> you were looking for straddled that 
boundary. Here is a simple test:

============

test1
"
self test1
"
     | f answer result fn |

     fn := 'foo.foo.foo'.
     FileDirectory default deleteFileNamed: fn.
     f := MultiByteFileStream fileNamed: fn.
     {1000. 1000. 1000. 1000} do: [ :len |
         len timesRepeat: [f nextPutAll: 'a'].
         f nextPutAll: 'bbb'.
     ].
     f close.
     result := OrderedCollection new.
     f := MultiByteFileStream fileNamed: fn.
     [f atEnd] whileFalse: [
         answer := f upToAll: 'bbb'.
         result add: {answer size. f position. "answer"}
     ].
     f close.

     ^result

=========

- write 1000 a's followed by 3 b's

- do this 4 times

- read it back by using upToAll: 'bbb'

- expect 4 1000-byte strings as the result

BUT you get

an OrderedCollection(
#(1000 1003)
#(1000 2006)
#(2006 3009)
#(1003 4012))

instead. The positions are right, but the lengths returned are not.

On 1/20/18 4:24 PM, Bernhard Pieber wrote:
> Hi everyone,
>
> I think I found a really strange bug in MultiByteFileStream. I am on macOS Sierra and used the latest VM from bintray and an updated trunk image.
>
> I try to parse anchors from a UTF-8 encoded HTML file (see attachment). It uses a MultiByteFileStream with a UTF8TextConverter.
>
> Here is the code that shows the bug:
>
> FileStream readOnlyFileNamed: 'test.html' do: [:stream |
> 	| result |
> 	result := OrderedCollection new.
> 	[stream atEnd] whileFalse: [
> 		stream match: '<A HREF="'.
> 		result add: (stream upToAll: '</A>')].
> 	result at: 13
> ].
>
> It answers the following string:
> 'https://www.europa.de/produkte/lebensversicherung">Darlehen sichern: Variable Risiko-Lebensversicherung</A>
> 				<DT><A HREF="http://orf.at/stories/2358210/2358209/">Banken im Zinsdilemma</A>
> 			</DL><p>
> 		</DL><p>
> 	</DL><p>
> </HTML>
> '
>
> You can see that it did not stop at the </A> as it should have but answers the rest of the file. The strange thing is that the next anchor looks like this:
> 'http://orf.at/stories/2358210/2358209/">Banken im Zinsdilemma</A>
> 			</DL><p>
> 		</DL><p>
> 	</DL><p>
> </HTML>
> '
> So it read part of the file again.
>
> I tried making the file smaller but the bug goes away then.
>
> As a cross check when I read the whole file at once it parses correctly.
>
> FileStream readOnlyFileNamed: 'test.html' do: [:fileStream |
> 	| stream result |
> 	stream := fileStream contentsOfEntireFile readStream.
> 	result := OrderedCollection new.
> 	[stream atEnd] whileFalse: [
> 		stream match: '<A HREF="'.
> 		result add: (stream upToAll: '</A>')].
> 	result at: 13
> ].
>
> Any ideas anyone?
>
> Bernhard
>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20180120/4e76c40b/attachment.html>


More information about the Squeak-dev mailing list