[Vm-dev] Re: vm problem on cog an stack v3

Nicolas Cellier nicolas.cellier.aka.nice at gmail.com
Wed Mar 30 20:46:21 UTC 2016


2016-03-30 13:46 GMT+02:00 Ben Coman <btc at openinworld.com>:

>
> On Wed, Mar 30, 2016 at 2:09 PM, Eliot Miranda <eliot.miranda at gmail.com>
> wrote:
> >
> > Hi Nicolas,
> >
> > On Tue, Mar 29, 2016 at 2:36 PM, Nicolas Cellier <
> nicolas.cellier.aka.nice at gmail.com> wrote:
> >>
> >>
> >>
> >>
> >> 2016-03-27 15:39 GMT+02:00 Nicolas Cellier <
> nicolas.cellier.aka.nice at gmail.com>:
> >>>
> >>> To be more specific,
> >>>
> >>> The first thing to do would be to confirm that the KernelNumber tests
> fail with a squeak.stack.v3 VM compiled with head revision of COG VM
> branch, whatever OS.
> >>>
> >>> Then knowing from which SVN version of COG VM branch the KernelNumber
> the tests start failing would be nice.
> >>>
> >>> The bisect job is to:
> >>> - iterate on version number (whatever strategy, bsearch or something)
> >>> - checkout VM sources
> >>> - compile the build.favouriteOS/squeak.stack.v3
> >>> - run a v3 image with the generated VM and launch KernelNumber tests
> >>>
> >>> Really a job for a jenkins bot, travis bot or whatever...
> >>> The next good thing would be to give a little love to build.squeak.org
> or anyother similar solution.
> >>> I only see red disks on this site...
> >>>
> >>
> >> Follow up: I did bissect manually:
> >> 3648 (VMMaker.oscog-eem.nice.1729) OK
> >> 3649 (VMMaker.oscog-eem.1740) Not OK
> >>
> >> So something went wrong for V3 between these two versions.
> >> At the same time, it works for spur.
> >>
> >> Spur objects are 8 bytes aligned, v3 objects are 4 bytes aligned...
> >> So fetching 64 bits from V3 MUST be decomposed into 2 32 bits fetch to
> avoid a segfault related to misalign fetch.
> >>
> >> OK, let's look how it is performed:
> >>
> >> fetchLong64: longIndex ofObject: oop
> >>     <returnTypeC: #sqLong>
> >>     ^self cppIf: BytesPerWord = 8
> >>         ifTrue: [self long64At: oop + self baseHeaderSize + (longIndex
> << 3)]
> >>         ifFalse:
> >>             [self cppIf: VMBIGENDIAN
> >>                 ifTrue: [((self long32At: oop + self baseHeaderSize +
> (longIndex << 3)) asUnsignedLongLong << 32)
> >>                     + (self long32At: oop + self baseHeaderSize +
> (longIndex << 3 + 4))]
> >>                 ifFalse: [(self long32At: oop + self baseHeaderSize +
> (longIndex << 3))
> >>                     + ((self long32At: oop + self baseHeaderSize +
> (longIndex << 3 + 4)) asUnsignedLongLong << 32)]]
> >>
> >> AH AH! Operation is:
> >>     low + (((unsigned) high) << 32)
> >>
> >> With low declared as SIGNED (long32At is signed GRRR).
> >> What if bit 32 of low is 1? then the compiler will perform:
> >>
> >> (unsigned long) low + ...
> >>
> >> AND THIS WILL DO A SIGN EXTENSION...
> >>
> >> It's not that my code was false...
> >> It's just that it uncovered a bug in fetchLong64:ofObject:
> >>
> >> That cost me many hours, but that enforce my conviction about
> signedness...
> >> I would much much much prefer to call unsignedLong32At because it's
> what I mean 9 times out of 10: get the magnitude...
> >
> >
> > Let's add them.  It would be great to have them all consistent too.  But
> at least let's add unsignedLong32At:[put:].  Wait.  In the simulator
> longAt:[put:] et al are unsigned.  So we have three choices:
> >
> > a) live with it
> > b) make longAt:[put:] long32At:put: et al unsigned in C, and add
> signedLongAt:[put:] et al.
> > c) make the simulator's longAt:[put:] long32At:put: signed and then add
> unsignedLongAt:[put:] et al
> >
> > Any others?
> >
> > I think a) is unwise.  The simulator and C should agree.  Nicolas, I can
> join your experience in finding that debugging these issues is extremely
> time consuming.
> >
> > My preference is for b); some uses where the type should be signed will
> by located by C compiler warnings; we hope that VM tests will catch the
> others)
>
> Wouldn't (b) risk occasional cognitive dysfunction switching between
> the semantic of aa C long being signed and a Slang long being
> unsigned?  (c) seems better long term, but switching the signedness of
> the an existing simulation method might cause short term instability
> and need to be tackled in one sitting?   So a naive question... Why
> leave one implicit?  Consider d) introducing both signedLongAt:[put:]
> and unsignedLongAt:[put:] and gradual migration.
>
> cheers -ben
>
>
My first thought was as Eliot, use b) because
- the simulator is already b)
- wild guess is that there are more unsigned than signed usage

But I like Ben arguments better.
Non uniform naming has great costs
- slow learning ramp for noobs
- seed for future bug due to misunderstanding.

For example we already have
* long implicitely meaning 32 bits like in signedIntToLong
signedIntFromLong,
* long unconditionnally meaning 64 bits like sqLong
* long meaning either 32 or 64 bits depending on architecture in generated
C (like in StackInterperter>>putLong:toFile:)
* long meaning either 32 or 64 depending on the context
  (#longAt: is allways 32 bits when sent to ByteArray Bitmap or CAccessor,
but might be 64 bits when sent to simulator argh...)

I remember I had similar problems with word the first time I looked at
VMMaker.
WordArray being allways 32 bits word, but wordSize being 4 or 8 depending
on target architecture.
Even now, I'm not sure of which word we are talking of...

Also b) goes with a blitz strategy:
- do a massive semantic change long32 -> unsigned
- review the send sites and change a few variant to signed
- repair broken C code once image start crashing and tests start failing
- repair the simulator which might show different symptoms (see
implementors of #asUnsignedLong)
either it works soon, or never...

I've noted only 47 senders of long32At: .
Plus the 33 senders of fetchLong32:ofObject:
And cannot analyze easily the possible side effects thru Slang type
inference...

As my recent bug of LargeInteger division showed: abuse of unsigned lead to
expensive bug tracking too.
That does a bit restrain my enthusiasm for solution b) - chat échaudé
craint l'eau froide ;)

Last, I did not apply b) policy in my VM branch for 32 bits large int.
Instead I introduced the unsigned variant gradually... (coward !)

Before we attempt anything, I'd like to remind the short term goals:
- avoid uncontrolled sign extension when promoting to larger int in C
- avoid undefined behavior, most of which being related to signed arithmetic
- make the simulator better match C (or vice versa in the case)

The last goal is far away, there are subtle "differences"...
Like in the simulator, the difference of two unsigned might be < 0.
In C not, unless we force it (cast, type punning, whatever...).

Last point, gradual introduction and non-regression testing pre-suppose a
stable VM.
I still experiment random crashes I'd like to track first.


> >
> >>
> >>> 2016-03-27 0:40 GMT+01:00 Nicolas Cellier <
> nicolas.cellier.aka.nice at gmail.com>:
> >>>>
> >>>> Hi,
> >>>> before I continue, i've noticed that the large integer multiply seems
> broken on v3 object memory (cog & stack)
> >>>> Note that this does not happen on Spur.
> >>>>
> >>>> This is independent of my recent changes of LargeIntegers plugin as
> it happens BEFORE these changes and is related to primitive 29 rather than
> to the plugin...
> >>>> Here are the symptoms:
> >>>>
> >>>> halfPower := 10000.
> >>>> s := 111111111111111.
> >>>> head := s quo: halfPower.
> >>>> tail := s - (head * halfPower).
> >>>> {
> >>>>    head as: ByteArray.
> >>>>    (1 to: halfPower digitLength) collect: [:i | halfPower digitAt: i]
> as: ByteArray.
> >>>>    (head*halfPower) as: ByteArray.
> >>>> }.
> >>>>
> >>>> the correct result is:
> >>>>  #(#[199 25 70 150 2] #[16 39] #[112 237 78 18 14 101])
> >>>>
> >>>> the wrong result I obtained with SVN revision 3651 compiled by myself
> is:
> >>>>  #(#[199 25 70 150 2] #[16 39] #[112 237 78 18 254 61])
> >>>>
> >>>> The most significant bits (above 32) are wrong...
> >>>> The pattern I obtain is (with most significant bit put back left)
> >>>>
> >>>> 2r00111101 << 8 +  2r11111110   "wrong result"
> >>>> 2r01100101 << 8 + 2r00001110  "Correct result"
> >>>>
> >>>> I completely fail to infer what's going on from this pattern...
> >>>>
> >>>> This is on MacOSX clang --version
> >>>> Apple LLVM version 7.3.0 (clang-703.0.29)
> >>>> Target: x86_64-apple-darwin15.4.0
> >>>>
> >>>> This goes thru primitiveMultiplyLargeIntegers (29)
> >>>> oopResult := self magnitude64BitIntegerFor: result neg: aIsNegative
> ~= bIsNegative.
> >>>> -> sz > 4
> >>>>                 ifTrue: [objectMemory storeLong64: 0 ofObject:
> newLargeInteger withValue: magnitude]
> >>>> (which I changed recently)
> >>>>
> >>>> then:
> >>>> storeLong64: longIndex ofObject: oop withValue: value
> >>>>     <var: #value type: #sqLong>
> >>>>     self flag: #endianness.
> >>>>     self long32At: oop + self baseHeaderSize + (longIndex << 3) put:
> (self cCode: [value] inSmalltalk: [value bitAnd: 16rFFFFFFFF]);
> >>>>         long32At: oop + self baseHeaderSize + (longIndex << 3) + 4
> put: (value >> 32).
> >>>>     ^value
> >>>>
> >>>> I don't see anything wrong with this code...
> >>>> Well, using a shift on signed value is not that good, but it works
> for at least 3 reasons:
> >>>> - we throw the signBit extension away
> >>>> - slang inlining misses the signedness difference, and the generated
> C code is correct.
> >>>> - Anyway, in our case, the sign bit was 0...
> >>>>
> >>>> Previous implementation in magnitude64BitIntegerFor:neg: was:
> >>>> sz > 4 ifTrue:
> >>>>                 [objectMemory storeLong32: 1 ofObject:
> newLargeInteger withValue: magnitude >> 32].
> >>>>             objectMemory
> >>>>                 storeLong32: 0
> >>>>                 ofObject: newLargeInteger
> >>>>                 withValue: (self cCode: [magnitude] inSmalltalk:
> [magnitude bitAnd: 16rFFFFFFFF])
> >>>>
> >>>> Not much different, except that the high 32 bits and low 32 bits are
> written in a different order...
> >>>>
> >>>> If I had a server I'd like to bisect
> >>>> - from which version does this happens
> >>>> - for which OS
> >>>> - for which compiler
> >>>>
> >>>> Without such information, I think I'll have to debug it either thru
> simulator or directly in gdb, but I feel like I'm really losing my time :(
> >>>>
> >>>> And I've got a 2nd problem like this one...
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
> >>
> >>
> >
> >
> >
> > --
> > _,,,^..^,,,_
> > best, Eliot
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20160330/7ba1a832/attachment-0001.htm


More information about the Vm-dev mailing list