[Vm-dev] Max number of method arguments

Wed Jan 9 00:46:13 UTC 2019

Le mar. 8 janv. 2019 à 23:31, Eliot Miranda <eliot.miranda at gmail.com> a
écrit :

>
> Hi Nicolas,
>
> On Tue, Jan 8, 2019 at 1:51 PM Nicolas Cellier <
> nicolas.cellier.aka.nice at gmail.com> wrote:
>
>>
>> Hi all,
>> particularly Clement and Eliot,
>>
>> One of the most annoying limit of bytecode is the number of arguments
>> (<16 in V3), not so much annoying for pure Smalltalk, but certainly so for
>> FFI (FORTRAN 77 lacks structures so existing code base often have functions
>> with many arguments).
>> For scientific Smalltalk, some of those old FORTRAN libraries are still
>> around nowadays (LAPACK is an example).
>>
>
> Agreed.  There are VW users out there with autogenerated code that
> requires more than 15 arguments.  Clément and I already have a design in
> mind, which is much more elegant than using the extra bit below.  However,
> it does require that we change the maximum Context stack size, which is one
> reason (the other being lack of time) why we haven't implemented this so
> far.
>
> ]In 2008 my closure design introduced indirection vectors for closed over
> arguments and among the five bytecodes added to implement it was the Create
> Array bytes ode that can do one of two things:
>
> V3PlusClosures:
> 138   10001010 jkkkkkkk Push (Array new: kkkkkkk) (j = 0)
> or Pop kkkkkkk elements into: (Array new: kkkkkkk) (j = 1)
>
> SistaV1:
> 231 11100111 jkkkkkkk Push (Array new: kkkkkkk) (j = 0)
> & Pop kkkkkkk elements into: (Array new: kkkkkkk) (j = 1)
>
> This bytecode is used to create indirection vectors, and to create tuples
> of size <= 8.  e.g. { thisContext method symbolic. 2. 3. 4. 5. 6. 7. 8 }
> #('89 <52> pushThisContext:
> 90 <81> send: method
> 91 <80> send: symbolic
> 92 <E8 02> pushConstant: 2
> 94 <E8 03> pushConstant: 3
> 96 <E8 04> pushConstant: 4
> 98 <E8 05> pushConstant: 5
> 100 <E8 06> pushConstant: 6
> 102 <E8 07> pushConstant: 7
> 104 <E8 08> pushConstant: 8
> 106 <E7 88> pop 8 into (Array new: 8)
> 108 <5C> returnTop
>
> We call this version of the bytecode the cons array bytecode.  The other
> form, used to create in direction vectors is the greater array bytecode.
>
> (c.f. { thisContext method symbolic. 2. 3. 4. 5. 6. 7. 8. 9 } which
> produces much more code but requires only 2 elements of stack depth).
>
> There are also three bytecodes used to access indirection vectors:
>
> V3PlusClosures:
> 140   10001100 kkkkkkkk jjjjjjjj Push Temp At kkkkkkkk In Temp Vector At:
> jjjjjjjj
> 141   10001101 kkkkkkkk jjjjjjjj Store Temp At kkkkkkkk In Temp Vector
> At: jjjjjjjj
> 142   10001110 kkkkkkkk jjjjjjjj Pop and Store Temp At kkkkkkkk In Temp
> Vector At: jjjjjjjj
> SistaV1:
> 251 11111011 kkkkkkkk sjjjjjjj Push Temp At kkkkkkkk In Temp Vector At:
> jjjjjjj, s = 1 implies remote inst var access instead of remote temp vector
> access
> * 252 (3) 11111100 kkkkkkkk sjjjjjjj Store Temp At kkkkkkkk In Temp
> Vector At: jjjjjjj s = 1 implies remote inst var access instead of remote
> temp vector access
> * 253 (3) 11111101 kkkkkkkk sjjjjjjj Pop and Store Temp At kkkkkkkk In
> Temp Vector At: jjjjjjj s = 1 implies remote inst var access instead of
> remote temp vector access
>
> So the insight is that if we pass arguments beyond 14 in an indirection
> vector we can have up to 15 + 127 = 142 arguments without needing any extra
> bits in a CompiledMethod header or range in a bytecode.  We simply pop
> arguments beyond the 14th into an indirection vector, using the cons array
> bytecode.  Yes, this is slow compared to "native" support, but such methods
> are extremely rare, and supporting them this way means we have less waste
> elsewhere.  It will require some sophistication in the Decompiler, but
> otherwise seems quite simple.
>
> With this design, as far as the VM is concerned the maximum argument count
> is still 15.  Only the image need bother with how to record the argument
> count for a method that has 15 or more arguments, and indeed a method with
> 15 arguments can still use all 15 arguments without having to create an
> indirection vector.  This isolates the effects to the compiler (arguments
> beyond the 14th in methods with more than 15 arguments must be accessed
> using the indirection vector bytecodes above), but otherwise are quite
> localized: indirection vector creation occurs immediately after normal
> argument marshaling and immediately before the send bytecode.
>
> Does this design appeal to you?  If it does, then we should discuss when
> and how it should be implemented.  One thing would be to make the maximum
> size of a Context, defined at the image level by CompiledCode's LargeFrame
> class variable, but hard coded into the VM, some kind of VM parameter, e.g.
> stored in the image header and read at start-up.  It would be quite easy to
> add this.  If we did so we should also ensure the stack page size
> calculation allows for a stack page big enough for one or two huge frames.
> Note that the design also means that a large stack is needed only to
> *marshal* arguments, not to activate a method with many arguments, since
> the excess arguments are stored in an indirection vector.
>
> P.S. Indeed we could use the scheme used for arbitrary sized tuples to
> marshall extra arguments, but this would affect code generation much more.
> Different code would have to be used to marshall each argument beyond 15;
> whereas using the cons array bytecode
>
>
Didn't we discussed that already?
The fact that we have not super fast calls is OK for me.
But I prefer a single Array because I will use invokeWthArguments: FFI
primitive.

transformFFICallwithManyArguments: aPattern
    "Transform current parseNode into an invokeWithArguments: send suitable
for the case of many arguments"

    ^BlockNode new
        temporaries: #() ; "Just to avoid later problems with BlockAnalyzer"
        arguments: #()
        statements: (OrderedCollection with: (MessageNode new
                    receiver: (BlockNode new
                            temporaries: #() ; "Just to avoid later
problems with BlockAnalyzer"
                            arguments: #()
                            statements: (OrderedCollection
                                    with: (MessageNode new
                                            receiver: (encoder
encodeLiteral: encoder literals first)
                                            selector: #invokeWithArguments:
                                            arguments: aPattern arguments
                                            precedence: 3
                                            from: encoder))
                            returns: false
                            from: encoder)
                    selector: #ifError:
                    arguments: (Array with: (parseNode temporaries:
parseNode temporaries; yourself))
                    precedence: 3
                    from: encoder) asReturnNode)
        returns: true
        from: encoder

to generate somthing like:

<cdecl: long 'dtgexc_' (long* long* long* double* long* double* long*
double* long* double* long* long* long* double* long* long*)>
65 <8B 78 00> callPrimitive: 120
68 <10> pushTemp: 0
69 <8F 10 00 04> closureNumCopied: 1 numArgs: 0 bytes 73 to 76
73     <23> pushConstant: <cdecl: long 'dtgexc_' (long* long* long* double*
long* double* long* double* long* double* long* long* long* double* long*
long*)>
74     <10> pushTemp: 0
75     <E2> send: invokeWithArguments:
76     <7D> blockReturn
77 <8F 00 00 03> closureNumCopied: 0 numArgs: 0 bytes 81 to 83
81     <70> self
82     <D4> send: externalCallFailed
83     <7C> returnTop
84 <E1> send: ifError:
85 <7C> returnTop

If first 14 arguments come individually, and excess arguments into a
separate Array, then I will have to remarshall all the arguments into a
larger Array.
Unless primitive 120 (primitiveCalloutToFFI) learns how to unmashall by
itself?
Currently primitive 120 will fail (I could eventually remove the primitive
number but I don't remember if some image side code depends on it...)

Note that I have a kind of SmalltalkPattern that will answer the already
packed list of arguments if # >15 (temp 0 above)
SmalltalkPatternForCodeGeneration>>arguments
    ^arrayArgumentVariable
            ifNil: [argumentNodes asArray]
            ifNotNil: [Array with: arrayArgumentVariable].

SmalltalkPattern>>bindArg: name encoder: encoder
    "Accumulate another argument.
    Check byte code limit and work around"
    | node |
    argumentNames size = 15
        ifTrue: [arrayArgumentVariable := encoder resetArgumentsAsArray:
#singleArrayOfArguments.
            argumentNodes := encoder
                        rebindArguments: argumentNames
                        asArray: arrayArgumentVariable
                        pattern: self].
    argumentNames add: name.
    arrayArgumentVariable isNil
        ifTrue: [node := encoder bindTemp: name]
        ifFalse: [node := self buildAccessorForArgumentRank: argumentNames
size encoder: encoder.
            encoder bind: name to: node].
    node nowHasDef nowHasRef.
    argumentNodes add: node.
    ^ node

For Marshalling, I already use the BraceNode, which we did not care yet to
optimize...
SmallapackMessageNode>>sizeCodeForValue: encoder
    | total argSize |
    arguments size > 15 ifFalse: [^super sizeCodeForValue: encoder].
    receiver == NodeSuper
        ifTrue: [selector := selector copy "only necess for splOops"].
    total := selector
                sizeCode: encoder
                args: 1
                super: receiver == NodeSuper.
    receiver == nil
        ifFalse: [total := total + (receiver sizeCodeForValue: encoder)].
    "re-use equalNode for creating the array argument... not a very clean
hack"
    equalNode := BraceNode new elements: arguments.
    argSize := equalNode sizeCodeForValue: encoder.
    total := total + argSize.
    sizes := Array with: argSize.
    ^ total

97 <70> self
98 <75> pushConstant: 0
99 <E0> send: cIntegerPointerOn:
100 <82 4F> popIntoTemp: 15
102 <70> self
103 <43> pushLitVar: #Array=>Array
104 <24> pushConstant: 16
105 <E2> send: braceStream:
106 <88> dup
107 <70> self
108 <10> pushTemp: 0
109 <E5> send: cLogicalPointerOn:
110 <C4> send: nextPut:
111 <87> pop
112 <88> dup
113 <70> self
114 <11> pushTemp: 1
115 <E5> send: cLogicalPointerOn:
116 <C4> send: nextPut:
117 <87> pop
118 <88> dup
119 <70> self
120 <12> pushTemp: 2
121 <E0> send: cIntegerPointerOn:
122 <C4> send: nextPut:
123 <87> pop
124 <88> dup
125 <13> pushTemp: 3
126 <C4> send: nextPut:
127 <87> pop
128 <88> dup
129 <70> self
130 <14> pushTemp: 4
131 <E0> send: cIntegerPointerOn:
132 <C4> send: nextPut:
133 <87> pop
134 <88> dup
135 <15> pushTemp: 5
136 <C4> send: nextPut:
137 <87> pop
138 <88> dup
139 <70> self
140 <16> pushTemp: 6
141 <E0> send: cIntegerPointerOn:
142 <C4> send: nextPut:
143 <87> pop
144 <88> dup
145 <17> pushTemp: 7
146 <C4> send: nextPut:
147 <87> pop
148 <88> dup
149 <70> self
150 <18> pushTemp: 8
151 <E0> send: cIntegerPointerOn:
152 <C4> send: nextPut:
153 <87> pop
154 <88> dup
155 <19> pushTemp: 9
156 <C4> send: nextPut:
157 <87> pop
158 <88> dup
159 <70> self
160 <1A> pushTemp: 10
161 <E0> send: cIntegerPointerOn:
162 <C4> send: nextPut:
163 <87> pop
164 <88> dup
165 <1B> pushTemp: 11
166 <C4> send: nextPut:
167 <87> pop
168 <88> dup
169 <1C> pushTemp: 12
170 <C4> send: nextPut:
171 <87> pop
172 <88> dup
173 <1D> pushTemp: 13
174 <C4> send: nextPut:
175 <87> pop
176 <88> dup
177 <70> self
178 <1E> pushTemp: 14
179 <E0> send: cIntegerPointerOn:
180 <C4> send: nextPut:
181 <87> pop
182 <88> dup
183 <1F> pushTemp: 15
184 <C4> send: nextPut:
185 <87> pop
186 <D6> send: braceArray
187 <E1> send:
#xtgexcWithwantq:wantz:n:a:lda:b:ldb:q:ldq:z:ldz:ifst:ilst:work:lwork:info:
(1 arg)
188 <87> pop
189 <1F> pushTemp: 15
190 <D8> send: getHandle
191 <76> pushConstant: 1
192 <E7> send: signedLongAt:
193 <7C> returnTop

For unmarshalling, I did not care too much to optimize the bytecode either,
because most use cases are just wrappers to FFI:
SmalltalkPatternForCodeGeneration>>buildAccessorForArgumentRank: index
encoder: encoder
    | indexNode |
    indexNode := encoder encodeLiteral: index.
    ^ MessageForTooManyArgNode new
        receiver: arrayArgumentVariable
        selector: #at:
        arguments: (Array with: indexNode)
        precedence: 3
        from: encoder

Note that the quick hacks that I already have could be transformed to the
form you suggest (14 args + excess_array) without too much efforts if you
think that primitiveCalloutToFFI (120) can be re-written to handle
14+excess...

I patched the old Squeak compiler in Smallapack to workaround this
>> limitation (it was easy enough to pass a single Array, and invoke FFI with
>> many args).
>> In modern Pharo flavour, this is more involved with the new OpalCompiler
>> (iit does not seem to be designed for extensibility as it seems necessary
>> to patch many pieces/subclasses for a single feature change...).
>>
>> But we now have Sista V1 bytecodes which removed a lot of limitations (#
>> inst vars, #literals, max jump offset ...). Alas I don't see a modified
>> limit for number of arguments (source:
>> https://hal.inria.fr/hal-01088801/document a bytecode set for adaptive
>> optimization): there is still a limit of 4 reserved bits in compiled method
>> header documented in link above.
>> Though, there is an adjacent unused bit now...
>> In Squeak,/Pharo, EncoderForSistaV1>>genSend:numArgs: suggests that the
>> limit is 31 (sic)
>>
>>     (nArgs < 0 or: [nArgs > 31]) ifTrue:
>>         [^self outOfRangeError: 'numArgs' index: nArgs range: 0 to: 31
>> "!!"].
>>
>> or at least 2047 if we believe code below:
>>
>>     "234        11101010    i i i i i j j j    Send Literal Selector
>> #iiiii (+ Extend A * 32) with jjj (+ Extend B * 8) Arguments"
>>
>>
>> https://github.com/pharo-project/pharo/blob/50992c3e5fed790b7e660954aee983f4681da658/src/Kernel-BytecodeEncoders/EncoderForSistaV1.class.st
>>
>> Pharo also limit the numArgs to 15 whatever the encoding in
>> CompiledMethod>>newBytes:trailerBytes:nArgs:nTemps:nStack:nLits:
>> primitive:
>>
>> https://github.com/pharo-project/pharo/blob/50992c3e5fed790b7e660954aee983f4681da658/src/Kernel/CompiledMethod.class.st
>>
>> But Squeak does not limit nArgs at all in
>>
>> EncoderForSistaV1>>computeMethodHeaderForNumArgs:numTemps:numLits:primitive:
>>
>> So my questions:
>> - is that doc up-to-date?
>> - if so, couldn't we expand the limit to 31 args by using the unused bit?
>>
>> Note: there is another unused bit in V3 (not adjacent), and the double
>> extended (send) byte code has room for 31 args in V3 too, since only the
>> first 3 bits of second byte encode the type of operation...
>>
>
>
> --
> _,,,^..^,,,_
> best, Eliot
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20190109/ba15ee92/attachment-0001.html>