[squeak-dev] MultiByteFileStream upToAll: strange bug

Bernhard Pieber bernhard at pieber.com
Sat Jan 20 21:24:01 UTC 2018


Hi everyone,

I think I found a really strange bug in MultiByteFileStream. I am on macOS Sierra and used the latest VM from bintray and an updated trunk image.

I try to parse anchors from a UTF-8 encoded HTML file (see attachment). It uses a MultiByteFileStream with a UTF8TextConverter.

Here is the code that shows the bug:

FileStream readOnlyFileNamed: 'test.html' do: [:stream | 
	| result |
	result := OrderedCollection new.
	[stream atEnd] whileFalse: [
		stream match: '<A HREF="'.
		result add: (stream upToAll: '</A>')].
	result at: 13
].

It answers the following string:
'https://www.europa.de/produkte/lebensversicherung">Darlehen sichern: Variable Risiko-Lebensversicherung</A>
				<DT><A HREF="http://orf.at/stories/2358210/2358209/">Banken im Zinsdilemma</A>
			</DL><p>
		</DL><p>
	</DL><p>
</HTML>
'

You can see that it did not stop at the </A> as it should have but answers the rest of the file. The strange thing is that the next anchor looks like this:
'http://orf.at/stories/2358210/2358209/">Banken im Zinsdilemma</A>
			</DL><p>
		</DL><p>
	</DL><p>
</HTML>
'
So it read part of the file again.

I tried making the file smaller but the bug goes away then.

As a cross check when I read the whole file at once it parses correctly.

FileStream readOnlyFileNamed: 'test.html' do: [:fileStream | 
	| stream result |
	stream := fileStream contentsOfEntireFile readStream.
	result := OrderedCollection new.
	[stream atEnd] whileFalse: [
		stream match: '<A HREF="'.
		result add: (stream upToAll: '</A>')].
	result at: 13
].

Any ideas anyone?

Bernhard

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20180120/040e869e/attachment.html>


More information about the Squeak-dev mailing list