<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html;

      charset=windows-1252">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <p><font face="Georgia">The problem occurs when crossing the 2kb

        size of the StandardFileStream </font><font face="Georgia"><font

          face="Georgia">variable <</font>collection>. The

        </A> you were looking for straddled that boundary. Here is

        a simple test:</font></p>

    ============<br>

    <p><font face="Georgia">test1<br>

        "<br>

        self test1 <br>

        "<br>

            | f answer result fn |<br>

            <br>

            fn := 'foo.foo.foo'.<br>

            FileDirectory default deleteFileNamed: fn.<br>

            f := MultiByteFileStream fileNamed: fn.<br>

            {1000. 1000. 1000. 1000} do: [ :len |<br>

                len timesRepeat: [f nextPutAll: 'a'].<br>

                f nextPutAll: 'bbb'.<br>

            ].<br>

            f close.<br>

            result := OrderedCollection new.<br>

            f := MultiByteFileStream fileNamed: fn.<br>

            [f atEnd] whileFalse: [<br>

                answer := f upToAll: 'bbb'.<br>

                result add: {answer size. f position. "answer"}<br>

            ].<br>

            f close.<br>

            <br>

            ^result    </font><br>

    </p>

    =========<br>

    <p><font face="Georgia">- write 1000 a's followed by 3 b's</font></p>

    <p><font face="Georgia">- do this 4 times</font></p>

    <p><font face="Georgia">- read it back by using upToAll: 'bbb'</font></p>

    <p><font face="Georgia">- expect 4 1000-byte strings as the result</font></p>

    BUT you get<br>

    <br>

    an OrderedCollection(<br>

    #(1000 1003) <br>

    #(1000 2006) <br>

    #(2006 3009) <br>

    #(1003 4012))<br>

    <br>

    instead. The positions are right, but the lengths returned are not.<br>

    <br>

    <div class="moz-cite-prefix">On 1/20/18 4:24 PM, Bernhard Pieber

      wrote:<br>

    </div>

    <blockquote type="cite"

      cite="mid:645F0CC4-4ADA-4328-92D1-C5E29B326056@pieber.com">

      <pre wrap="">Hi everyone,

I think I found a really strange bug in MultiByteFileStream. I am on macOS Sierra and used the latest VM from bintray and an updated trunk image.

I try to parse anchors from a UTF-8 encoded HTML file (see attachment). It uses a MultiByteFileStream with a UTF8TextConverter.

Here is the code that shows the bug:

FileStream readOnlyFileNamed: 'test.html' do: [:stream | 

        | result |

        result := OrderedCollection new.

        [stream atEnd] whileFalse: [

                stream match: '<A HREF="'.

                result add: (stream upToAll: '</A>')].

        result at: 13

].

It answers the following string:

'<a class="moz-txt-link-freetext" href="https://www.europa.de/produkte/lebensversicherung">https://www.europa.de/produkte/lebensversicherung</a>">Darlehen sichern: Variable Risiko-Lebensversicherung</A>

                                <DT><A HREF=<a class="moz-txt-link-rfc2396E" href="http://orf.at/stories/2358210/2358209/">"http://orf.at/stories/2358210/2358209/"</a>>Banken im Zinsdilemma</A>

                        </DL><p>

                </DL><p>

        </DL><p>

</HTML>

'

You can see that it did not stop at the </A> as it should have but answers the rest of the file. The strange thing is that the next anchor looks like this:

'<a class="moz-txt-link-freetext" href="http://orf.at/stories/2358210/2358209/">http://orf.at/stories/2358210/2358209/</a>">Banken im Zinsdilemma</A>

                        </DL><p>

                </DL><p>

        </DL><p>

</HTML>

'

So it read part of the file again.

I tried making the file smaller but the bug goes away then.

As a cross check when I read the whole file at once it parses correctly.

FileStream readOnlyFileNamed: 'test.html' do: [:fileStream | 

        | stream result |

        stream := fileStream contentsOfEntireFile readStream.

        result := OrderedCollection new.

        [stream atEnd] whileFalse: [

                stream match: '<A HREF="'.

                result add: (stream upToAll: '</A>')].

        result at: 13

].

Any ideas anyone?

Bernhard

</pre>

      <br>

      <fieldset class="mimeAttachmentHeader"></fieldset>

      <br>

      <pre wrap="">

</pre>

    </blockquote>

    <br>

  </body>

</html>