<html>
<head>
<meta http-equiv="Content-Type" content="text/html;
charset=windows-1252">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<p><font face="Georgia">The problem occurs when crossing the 2kb
size of the StandardFileStream </font><font face="Georgia"><font
face="Georgia">variable <</font>collection>. The
</A> you were looking for straddled that boundary. Here is
a simple test:</font></p>
============<br>
<p><font face="Georgia">test1<br>
"<br>
self test1 <br>
"<br>
| f answer result fn |<br>
<br>
fn := 'foo.foo.foo'.<br>
FileDirectory default deleteFileNamed: fn.<br>
f := MultiByteFileStream fileNamed: fn.<br>
{1000. 1000. 1000. 1000} do: [ :len |<br>
len timesRepeat: [f nextPutAll: 'a'].<br>
f nextPutAll: 'bbb'.<br>
].<br>
f close.<br>
result := OrderedCollection new.<br>
f := MultiByteFileStream fileNamed: fn.<br>
[f atEnd] whileFalse: [<br>
answer := f upToAll: 'bbb'.<br>
result add: {answer size. f position. "answer"}<br>
].<br>
f close.<br>
<br>
^result </font><br>
</p>
=========<br>
<p><font face="Georgia">- write 1000 a's followed by 3 b's</font></p>
<p><font face="Georgia">- do this 4 times</font></p>
<p><font face="Georgia">- read it back by using upToAll: 'bbb'</font></p>
<p><font face="Georgia">- expect 4 1000-byte strings as the result</font></p>
BUT you get<br>
<br>
an OrderedCollection(<br>
#(1000 1003) <br>
#(1000 2006) <br>
#(2006 3009) <br>
#(1003 4012))<br>
<br>
instead. The positions are right, but the lengths returned are not.<br>
<br>
<div class="moz-cite-prefix">On 1/20/18 4:24 PM, Bernhard Pieber
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:645F0CC4-4ADA-4328-92D1-C5E29B326056@pieber.com">
<pre wrap="">Hi everyone,
I think I found a really strange bug in MultiByteFileStream. I am on macOS Sierra and used the latest VM from bintray and an updated trunk image.
I try to parse anchors from a UTF-8 encoded HTML file (see attachment). It uses a MultiByteFileStream with a UTF8TextConverter.
Here is the code that shows the bug:
FileStream readOnlyFileNamed: 'test.html' do: [:stream |
| result |
result := OrderedCollection new.
[stream atEnd] whileFalse: [
stream match: '<A HREF="'.
result add: (stream upToAll: '</A>')].
result at: 13
].
It answers the following string:
'<a class="moz-txt-link-freetext" href="https://www.europa.de/produkte/lebensversicherung">https://www.europa.de/produkte/lebensversicherung</a>">Darlehen sichern: Variable Risiko-Lebensversicherung</A>
<DT><A HREF=<a class="moz-txt-link-rfc2396E" href="http://orf.at/stories/2358210/2358209/">"http://orf.at/stories/2358210/2358209/"</a>>Banken im Zinsdilemma</A>
</DL><p>
</DL><p>
</DL><p>
</HTML>
'
You can see that it did not stop at the </A> as it should have but answers the rest of the file. The strange thing is that the next anchor looks like this:
'<a class="moz-txt-link-freetext" href="http://orf.at/stories/2358210/2358209/">http://orf.at/stories/2358210/2358209/</a>">Banken im Zinsdilemma</A>
</DL><p>
</DL><p>
</DL><p>
</HTML>
'
So it read part of the file again.
I tried making the file smaller but the bug goes away then.
As a cross check when I read the whole file at once it parses correctly.
FileStream readOnlyFileNamed: 'test.html' do: [:fileStream |
| stream result |
stream := fileStream contentsOfEntireFile readStream.
result := OrderedCollection new.
[stream atEnd] whileFalse: [
stream match: '<A HREF="'.
result add: (stream upToAll: '</A>')].
result at: 13
].
Any ideas anyone?
Bernhard
</pre>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">
</pre>
</blockquote>
<br>
</body>
</html>