[squeak-dev] Fwd: [Pharo-dev] Random corrupted data when copying from very large byte array

Eliot Miranda eliot.miranda at gmail.com
Fri Jan 19 22:05:30 UTC 2018


---------- Forwarded message ----------
From: Eliot Miranda <eliot.miranda at gmail.com>
Date: Fri, Jan 19, 2018 at 2:04 PM
Subject: Re: [Pharo-dev] Random corrupted data when copying from very large
byte array
To: Pharo Development List <pharo-dev at lists.pharo.org>


Hi Alistair, Hi Clément,

On Fri, Jan 19, 2018 at 12:53 PM, Alistair Grant <akgrant0710 at gmail.com>
wrote:

> Hi Clément,
>
> On 19 January 2018 at 17:21, Alistair Grant <akgrant0710 at gmail.com> wrote:
> > Hi Clément,
> >
> > On 19 January 2018 at 17:04, Clément Bera <bera.clement at gmail.com>
> wrote:
> >> Does not seem to be related to prim 105.
> >>
> >> I am confused. Has the size of the array an impact at all ?
> >
> > Yes, I tried reducing the size of the array by a factor of 10 and
> > wasn't able to reproduce the problem at all.
> >
> > With the full size array it failed over half the time (32 bit).
> >
> > I ran the test about 180 times on 64 bit and didn't get a single failure.
> >
> >> It seems the
> >> problem shows since the first copy of 16k elements.
> >>
> >> I can't really reproduce the bug - it happened once but I cannot do it
> >> again. Does the bug happen with the StackVM/PharoS VM you can find here
> the
> >> 32 bits versions : http://files.pharo.org/vm/pharoS-spur32/ ? The
> >> StackVM/PharoS VM is the VM without the JIT, it may be since the bug is
> >> unreliable that it happens only in jitted code, so trying that out may
> be
> >> worth it.
> >
> > I'll try and have a look at this over the weekend.
>
> This didn't fail once in 55 runs.
>
> OS: Ubuntu 16.04
> Image: Pharo 6.0   Latest update: #60528
> VM:
> 5.0 #1 Wed Oct 12 15:48:53 CEST 2016 gcc 4.6.3 [Production Spur ITHB VM]
> StackInterpreter VMMaker.oscog-EstebanLorenzano.1881 uuid:
> ed616067-a57c-409b-bfb6-dab51f058235 Oct 12 2016
> https://github.com/pharo-project/pharo-vm.git Commit:
> 01a03276a2e2b243cd4a7d3427ba541f835c07d3 Date: 2016-10-12 14:31:09
> +0200 By: Esteban Lorenzano <estebanlm at gmail.com> Jenkins build #606
> Linux pharo-linux 3.2.0-31-generic-pae #50-Ubuntu SMP Fri Sep 7
> 16:39:45 UTC 2012 i686 i686 i386 GNU/Linux
> plugin path: /home/alistair/pharo7/Issue20982/bin/ [default:
> /home/alistair/pharo7/Issue20982/bin/]
>
>
> I then went back and attempted to reproduce the failures in my regular
> 32 bit image, but only got 1 corruption in 10 runs.  I've been working
> in this image without restarting for most of the day.
>
> Quitting out and restarting the image and then running the corruption
> check resulted in 11 corruptions from 11 runs.
>
>
> Image: Pharo 7.0 Build information:
> Pharo-7.0+alpha.build.425.sha.eb0a6fb140ac4a42b1f158ed37717e0650f778b4
> (32 Bit)
> VM:
> 5.0-201801110739  Thursday 11 January  09:30:12 CET 2018 gcc 4.8.5
> [Production Spur VM]
> CoInterpreter VMMaker.oscog-eem.2302 uuid:
> 55ec8f63-cdbe-4e79-8f22-48fdea585b88 Jan 11 2018
> StackToRegisterMappingCogit VMMaker.oscog-eem.2302 uuid:
> 55ec8f63-cdbe-4e79-8f22-48fdea585b88 Jan 11 2018
> VM: 201801110739
> alistair at alistair-xps13:snap/pharo-snap/pharo-vm/opensmalltalk-vm $
> Date: Wed Jan 10 23:39:30 2018 -0800 $
> Plugins: 201801110739
> alistair at alistair-xps13:snap/pharo-snap/pharo-vm/opensmalltalk-vm $
> Linux b07d7880072c 4.13.0-26-generic #29~16.04.2-Ubuntu SMP Tue Jan 9
> 22:00:44 UTC 2018 i686 i686 i686 GNU/Linux
> plugin path: /snap/core/3748/lib/i386-linux-gnu/ [default:
> /snap/core/3748/lib/i386-linux-gnu/]
>
>
> So, as well as restarting the image before running the test, just
> wondering if the gcc compiler version could have an impact?
>

I suspect that the problem is the same compactor bug I've been trying to
reproduce all week, and have just fixed.  Could you try and reproduce with
a VM built from the latest commit?

Some details:
The SpurPlanningCompactor works by using the fact that all Spur objects
have room for a forwarding pointer.  The compactor make three passes:

- the first pass through memory works out where objects will go, replacig
their first fields with where they will go, and saving their first fields
in a buffer (savedFirstFieldsSpace).
- the second pass scans all pointer objects, replacing their fields with
where the objects referenced will go (following the forwarding pointers),
and also relocates any pointer fields in savedFirstFieldsSpace
- the final pass slides objects down, restoring their relocated first fields

The buffer used for savedFirstFieldsSpace determines how many passes are
used.  The system will either use eden (which is empty when compaction
occurs) or a large free chunk or allocate a new segment, depending on
whatever yields the largest space.  So in the right circumstances eden will
be used and more than one pass required.

The bug was that when multiple passes are used the compactor forgot to
unmark the corpse left behind when the object was moved.  Instead of the
corpse being changed into free space it was retained, but its first field
would be that of the forwarding pointer to its new location, not the actual
first field.  So on 32-bits a ByteArray that should have been collected
would have its first 4 bytes appear to be invalid, and on 64-bits its first
8 bytes.  Because the heap on 64-bits can grow larger it could be that the
bug shows itself much less frequently than on 32-bits. When compaction can
be completed in a single pass all corpses are correctly collected, so most
of the time the bug is hidden.

This is the commit:
commit 0fe1e1ea108e53501a0e728736048062c83a66ce
Author: Eliot Miranda <eliot.miranda at gmail.com>
Date:   Fri Jan 19 13:17:57 2018 -0800

    CogVM source as per VMMaker.oscog-eem.2320

    Spur:
    Fix a bad bug in SpurPlnningCompactor.  unmarkObjectsFromFirstFreeObje
ct,
    used when the compactor requires more than one pass due to insufficient
    savedFirstFieldsSpace, expects the corpse of a moved object to be
unmarked,
    but copyAndUnmarkObject:to:bytes:firstField: only unmarked the target.
    Unmarking the corpse before the copy unmarks both.  This fixes a crash
with
    ReleaseBuilder class>>saveAsNewRelease when non-use of cacheDuring:
creates
    lots of files, enough to push the system into the multi-pass regime.


>
> HTH,
> Alistair
>
>
>
> > Cheers,
> > Alistair
> >
> >
> >
> >> On Thu, Jan 18, 2018 at 7:12 PM, Clément Bera <bera.clement at gmail.com>
> >> wrote:
> >>>
> >>> I would suspect a bug in primitive 105 on byte objects (it was changed
> >>> recently in the VM), called by copyFrom: 1 to: readCount. The bug would
> >>> likely by due to specific alignment in readCount or something like
> that.
> >>> (Assuming you're in 32 bits since the 4 bytes are corrupted).
> >>>
> >>> When I get better I can have a look (I am currently quite sick).
> >>>
> >>> On Thu, Jan 18, 2018 at 4:51 PM, Thierry Goubier
> >>> <thierry.goubier at gmail.com> wrote:
> >>>>
> >>>> Hi Cyril,
> >>>>
> >>>> try with the last vms available at:
> >>>>
> >>>> https://bintray.com/opensmalltalk/vm/cog/
> >>>>
> >>>> For example, the last Ubuntu 64bits vm is at:
> >>>>
> >>>> https://bintray.com/opensmalltalk/vm/cog/201801170946#files
> >>>>
> >>>> Regards,
> >>>>
> >>>> Thierry
> >>>>
> >>>> 2018-01-18 16:42 GMT+01:00 Cyrille Delaunay <cy.delaunay at gmail.com>:
> >>>>>
> >>>>> Hi everyone,
> >>>>>
> >>>>> I just added a new bug entry for an issue we are experimenting since
> >>>>> some times:
> >>>>>
> >>>>>
> >>>>> https://pharo.fogbugz.com/f/cases/20982/Random-corrupted-dat
> a-when-copying-from-very-large-byte-array
> >>>>>
> >>>>> Here is the description:
> >>>>>
> >>>>>
> >>>>> History:
> >>>>>
> >>>>> This issue has been spotted after experimenting strange behavior with
> >>>>> seaside upload.
> >>>>> After uploading a big file from a web browser, the modeled file
> >>>>> generated within pharo image begins with 4 unexpected bytes.
> >>>>> This issue occurs randomly: sometimes the first 4 bytes are right.
> >>>>> Sometimes the first 4 bytes are wrong.
> >>>>> This issue only occurs with Pharo 6.
> >>>>> This issue occurs for all platforms (Mac, Ubuntu, Windows)
> >>>>>
> >>>>> Steps to reproduce:
> >>>>>
> >>>>> I have been able to set up a small scenario that highlight the issue.
> >>>>>
> >>>>> Download Pharo 6.1 on my Mac (Sierra 10.12.6):
> >>>>> https://pharo.org/web/download
> >>>>> Then, iterate over this process till spotting the issue:
> >>>>>
> >>>>> => start the pharo image
> >>>>> => execute this piece of code in a playground
> >>>>>
> >>>>> 1:
> >>>>> 2:
> >>>>> 3:
> >>>>> 4:
> >>>>> 5:
> >>>>> 6:
> >>>>>
> >>>>> ZnServer startDefaultOn: 1701.
> >>>>> ZnServer default maximumEntitySize: 80* 1024 * 1024.
> >>>>> '/Users/cdelaunay/myzip.zip' asFileReference writeStreamDo: [ :out |
> >>>>> out binary; nextPutAll: #[80 75 3 4 10 0 0 0 0 0 125 83 67 73 0 0 0
> 0 0
> >>>>> 0].
> >>>>> 18202065 timesRepeat: [ out nextPut: 0 ]
> >>>>> ].
> >>>>>
> >>>>> => Open a web browser page on: http://localhost:1701/form-test-3
> >>>>> => Upload the file zip file previously generated ('myzip.zip').
> >>>>> => If the web page displays: "contents=000000000a00..." (or anything
> >>>>> unexpected), THIS IS THE ISSUE !
> >>>>> => If the web page displays: "contents=504b03040a00..", the upload
> >>>>> worked fine. You can close the image (without saving)
> >>>>>
> >>>>>
> >>>>>
> >>>>> Debugging:
> >>>>>
> >>>>>
> >>>>>
> >>>>> Bob Arning has been able to reproduce the issue with my scenario.
> >>>>> He dived into the code involved during this process, till reaching
> some
> >>>>> "basic" methods where he saw the issue occuring.
> >>>>>
> >>>>> Here are the conclusion till there:
> >>>>> => A corruption occurs while reading an input stream with method
> ZnUtils
> >>>>> class>>readUpToEnd:limit:
> >>>>> The first 4 bytes may be altered randomely.
> >>>>> => The first 4 bytes are initially correctly written to an
> outputStream.
> >>>>> But, the first 4 bytes of this outputStream gets altered (corrupted),
> >>>>> sometimes when the inner byte array grows OR when performing the
> final
> >>>>> "outputStream contents"
> >>>>> => Here is a piece of code that reproduce the issue (still randomely.
> >>>>> stopping an restarting the image may change the behavior)
> >>>>>
> >>>>> 1:
> >>>>> 2:
> >>>>> 3:
> >>>>> 4:
> >>>>> 5:
> >>>>> 6:
> >>>>> 7:
> >>>>> 8:
> >>>>> 9:
> >>>>> 10:
> >>>>> 11:
> >>>>> 12:
> >>>>> 13:
> >>>>> 14:
> >>>>> 15:
> >>>>> 16:
> >>>>> 17:
> >>>>> 18:
> >>>>> 19:
> >>>>> 20:
> >>>>>
> >>>>> test4"self test4"    | species bufferSize buffer totalRead
> outputStream
> >>>>> answer inputStream ba byte1 |            ba := ByteArray new:
> 18202085.
> >>>>> ba atAllPut: 99.        1 to: 20 do: [  :i | ba at: i put: (#[80 75
> 3 4 10 7
> >>>>> 7 7 7 7 125 83 67 73 7 7 7 7 7 7] at: i) ].    inputStream := ba
> readStream.
> >>>>> bufferSize := 16384.    species := ByteArray.
> >>>>>     buffer := species new: bufferSize.
> >>>>>     totalRead := 0.
> >>>>>     outputStream := nil.
> >>>>>     [ inputStream atEnd ] whileFalse: [ | readCount |
> >>>>>         readCount := inputStream readInto: buffer startingAt: 1
> count:
> >>>>> bufferSize.
> >>>>>         totalRead = 0 ifTrue: [
> >>>>>             byte1 := buffer first.
> >>>>>         ].
> >>>>>         totalRead := totalRead + readCount.
> >>>>>
> >>>>>         outputStream ifNil: [
> >>>>>             inputStream atEnd
> >>>>>                 ifTrue: [ ^ buffer copyFrom: 1 to: readCount ]
> >>>>>                 ifFalse: [ outputStream := (species new: bufferSize)
> >>>>> writeStream ] ].
> >>>>>         outputStream next: readCount putAll: buffer startingAt: 1.
> >>>>>         byte1 = outputStream contents first ifFalse: [ self halt ].
> >>>>>     ].
> >>>>>     answer := outputStream ifNil: [ species new ] ifNotNil: [
> >>>>> outputStream contents ].
> >>>>>     byte1 = answer first ifFalse: [ self halt ].    ^answer
> >>>>>
> >>>>>
> >>>>>
> >>>>> suspicions
> >>>>>
> >>>>> This issue appeared with Pharo 6.
> >>>>>
> >>>>> Some people suggested that it could be a vm issue, and to try my
> little
> >>>>> scenario with the last available vm.
> >>>>>
> >>>>> I am not sure where to find the last available vm.
> >>>>>
> >>>>> I did the test using these elements:
> >>>>>
> >>>>> https://files.pharo.org/image/60/latest.zip
> >>>>>
> >>>>> https://files.pharo.org/get-files/70/pharo-mac-latest.zip/
> >>>>>
> >>>>>
> >>>>>
> >>>>> The issue is still present
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Cyrille Delaunay
> >>>>
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> Clément Béra
> >>> Pharo consortium engineer
> >>> https://clementbera.wordpress.com/
> >>> Bâtiment B 40, avenue Halley 59650 Villeneuve d'Ascq
> >>
> >>
> >>
> >>
> >> --
> >> Clément Béra
> >> Pharo consortium engineer
> >> https://clementbera.wordpress.com/
> >> Bâtiment B 40, avenue Halley 59650 Villeneuve d'Ascq
>
>


-- 
_,,,^..^,,,_
best, Eliot



-- 
_,,,^..^,,,_
best, Eliot
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/squeak-dev/attachments/20180119/0e542616/attachment.html>


More information about the Squeak-dev mailing list