[Vm-dev] Max number of method arguments

Nicolas Cellier nicolas.cellier.aka.nice at gmail.com
Wed Jan 9 21:26:24 UTC 2019


After the existing legacy Compiler implementation in Squeak/Pharo,
I draw the contour of an Opal Compiler variant for compiling
methods/messages with more than 15 arguments.
I'am using the single array scheme, but this could be changed easily IMO to
14+excess...

Contrarily to what I wrote in the comment, the VM does NOT crash and I can
implement methods and send messages like this

a1: a1 a2: a2 a3: a3 a4: a4 a5: a5 a6: a6 a7: a7 a8: a8 a9: a9 a10: a10
a11: a11 a12: a12 a13: a13 a14: a14 a15: a15 a16: a16 a17: a17 a18: a18
a19: a19 a20: a20
    ^(a1+a2+a3+a4+a5+a6+a7+a8+a9+a10) +
(a11+a12+a13+a14+a15+a16+a17+a18+a19+a20)

testSend
    ^(self
        a1: 1 a2: 2 a3: 3 a4: 4 a5: 5 a6: 6 a7: 7 a8: 8 a9: 9 a10: 10
        a11: 11 a12: 12 a13: 13 a14: 14 a15: 15 a16: 16 a17: 17 a18: 18
a19: 19 a20: 20) = (20*21/2)

Playground: TestManyArgs new testSend. -> true

Find it here:
http://www.squeaksource.com/Smallapack/Smallapack-OpalCompiler-nice.1.diff

It would be nice to re-integrate these tricks in Pharo8 once stable.

Le mer. 9 janv. 2019 à 01:46, Nicolas Cellier <
nicolas.cellier.aka.nice at gmail.com> a écrit :

>
>
> Le mar. 8 janv. 2019 à 23:31, Eliot Miranda <eliot.miranda at gmail.com> a
> écrit :
>
>>
>> Hi Nicolas,
>>
>> On Tue, Jan 8, 2019 at 1:51 PM Nicolas Cellier <
>> nicolas.cellier.aka.nice at gmail.com> wrote:
>>
>>>
>>> Hi all,
>>> particularly Clement and Eliot,
>>>
>>> One of the most annoying limit of bytecode is the number of arguments
>>> (<16 in V3), not so much annoying for pure Smalltalk, but certainly so for
>>> FFI (FORTRAN 77 lacks structures so existing code base often have functions
>>> with many arguments).
>>> For scientific Smalltalk, some of those old FORTRAN libraries are still
>>> around nowadays (LAPACK is an example).
>>>
>>
>> Agreed.  There are VW users out there with autogenerated code that
>> requires more than 15 arguments.  Clément and I already have a design in
>> mind, which is much more elegant than using the extra bit below.  However,
>> it does require that we change the maximum Context stack size, which is one
>> reason (the other being lack of time) why we haven't implemented this so
>> far.
>>
>> ]In 2008 my closure design introduced indirection vectors for closed over
>> arguments and among the five bytecodes added to implement it was the Create
>> Array bytes ode that can do one of two things:
>>
>> V3PlusClosures:
>> 138   10001010 jkkkkkkk Push (Array new: kkkkkkk) (j = 0)
>> or Pop kkkkkkk elements into: (Array new: kkkkkkk) (j = 1)
>>
>> SistaV1:
>> 231 11100111 jkkkkkkk Push (Array new: kkkkkkk) (j = 0)
>> & Pop kkkkkkk elements into: (Array new: kkkkkkk) (j = 1)
>>
>> This bytecode is used to create indirection vectors, and to create tuples
>> of size <= 8.  e.g. { thisContext method symbolic. 2. 3. 4. 5. 6. 7. 8 }
>> #('89 <52> pushThisContext:
>> 90 <81> send: method
>> 91 <80> send: symbolic
>> 92 <E8 02> pushConstant: 2
>> 94 <E8 03> pushConstant: 3
>> 96 <E8 04> pushConstant: 4
>> 98 <E8 05> pushConstant: 5
>> 100 <E8 06> pushConstant: 6
>> 102 <E8 07> pushConstant: 7
>> 104 <E8 08> pushConstant: 8
>> 106 <E7 88> pop 8 into (Array new: 8)
>> 108 <5C> returnTop
>>
>> We call this version of the bytecode the cons array bytecode.  The other
>> form, used to create in direction vectors is the greater array bytecode.
>>
>> (c.f. { thisContext method symbolic. 2. 3. 4. 5. 6. 7. 8. 9 } which
>> produces much more code but requires only 2 elements of stack depth).
>>
>> There are also three bytecodes used to access indirection vectors:
>>
>> V3PlusClosures:
>> 140   10001100 kkkkkkkk jjjjjjjj Push Temp At kkkkkkkk In Temp Vector
>> At: jjjjjjjj
>> 141   10001101 kkkkkkkk jjjjjjjj Store Temp At kkkkkkkk In Temp Vector
>> At: jjjjjjjj
>> 142   10001110 kkkkkkkk jjjjjjjj Pop and Store Temp At kkkkkkkk In Temp
>> Vector At: jjjjjjjj
>> SistaV1:
>> 251 11111011 kkkkkkkk sjjjjjjj Push Temp At kkkkkkkk In Temp Vector At:
>> jjjjjjj, s = 1 implies remote inst var access instead of remote temp vector
>> access
>> * 252 (3) 11111100 kkkkkkkk sjjjjjjj Store Temp At kkkkkkkk In Temp
>> Vector At: jjjjjjj s = 1 implies remote inst var access instead of remote
>> temp vector access
>> * 253 (3) 11111101 kkkkkkkk sjjjjjjj Pop and Store Temp At kkkkkkkk In
>> Temp Vector At: jjjjjjj s = 1 implies remote inst var access instead of
>> remote temp vector access
>>
>> So the insight is that if we pass arguments beyond 14 in an indirection
>> vector we can have up to 15 + 127 = 142 arguments without needing any extra
>> bits in a CompiledMethod header or range in a bytecode.  We simply pop
>> arguments beyond the 14th into an indirection vector, using the cons array
>> bytecode.  Yes, this is slow compared to "native" support, but such methods
>> are extremely rare, and supporting them this way means we have less waste
>> elsewhere.  It will require some sophistication in the Decompiler, but
>> otherwise seems quite simple.
>>
>> With this design, as far as the VM is concerned the maximum argument
>> count is still 15.  Only the image need bother with how to record the
>> argument count for a method that has 15 or more arguments, and indeed a
>> method with 15 arguments can still use all 15 arguments without having to
>> create an indirection vector.  This isolates the effects to the compiler
>> (arguments beyond the 14th in methods with more than 15 arguments must be
>> accessed using the indirection vector bytecodes above), but otherwise are
>> quite localized: indirection vector creation occurs immediately after
>> normal argument marshaling and immediately before the send bytecode.
>>
>> Does this design appeal to you?  If it does, then we should discuss when
>> and how it should be implemented.  One thing would be to make the maximum
>> size of a Context, defined at the image level by CompiledCode's LargeFrame
>> class variable, but hard coded into the VM, some kind of VM parameter, e.g.
>> stored in the image header and read at start-up.  It would be quite easy to
>> add this.  If we did so we should also ensure the stack page size
>> calculation allows for a stack page big enough for one or two huge frames.
>> Note that the design also means that a large stack is needed only to
>> *marshal* arguments, not to activate a method with many arguments, since
>> the excess arguments are stored in an indirection vector.
>>
>> P.S. Indeed we could use the scheme used for arbitrary sized tuples to
>> marshall extra arguments, but this would affect code generation much more.
>> Different code would have to be used to marshall each argument beyond 15;
>> whereas using the cons array bytecode
>>
>>
> Didn't we discussed that already?
> The fact that we have not super fast calls is OK for me.
> But I prefer a single Array because I will use invokeWthArguments: FFI
> primitive.
>
> transformFFICallwithManyArguments: aPattern
>     "Transform current parseNode into an invokeWithArguments: send
> suitable for the case of many arguments"
>
>     ^BlockNode new
>         temporaries: #() ; "Just to avoid later problems with
> BlockAnalyzer"
>         arguments: #()
>         statements: (OrderedCollection with: (MessageNode new
>                     receiver: (BlockNode new
>                             temporaries: #() ; "Just to avoid later
> problems with BlockAnalyzer"
>                             arguments: #()
>                             statements: (OrderedCollection
>                                     with: (MessageNode new
>                                             receiver: (encoder
> encodeLiteral: encoder literals first)
>                                             selector: #invokeWithArguments:
>                                             arguments: aPattern arguments
>                                             precedence: 3
>                                             from: encoder))
>                             returns: false
>                             from: encoder)
>                     selector: #ifError:
>                     arguments: (Array with: (parseNode temporaries:
> parseNode temporaries; yourself))
>                     precedence: 3
>                     from: encoder) asReturnNode)
>         returns: true
>         from: encoder
>
> to generate somthing like:
>
> <cdecl: long 'dtgexc_' (long* long* long* double* long* double* long*
> double* long* double* long* long* long* double* long* long*)>
> 65 <8B 78 00> callPrimitive: 120
> 68 <10> pushTemp: 0
> 69 <8F 10 00 04> closureNumCopied: 1 numArgs: 0 bytes 73 to 76
> 73     <23> pushConstant: <cdecl: long 'dtgexc_' (long* long* long*
> double* long* double* long* double* long* double* long* long* long* double*
> long* long*)>
> 74     <10> pushTemp: 0
> 75     <E2> send: invokeWithArguments:
> 76     <7D> blockReturn
> 77 <8F 00 00 03> closureNumCopied: 0 numArgs: 0 bytes 81 to 83
> 81     <70> self
> 82     <D4> send: externalCallFailed
> 83     <7C> returnTop
> 84 <E1> send: ifError:
> 85 <7C> returnTop
>
> If first 14 arguments come individually, and excess arguments into a
> separate Array, then I will have to remarshall all the arguments into a
> larger Array.
> Unless primitive 120 (primitiveCalloutToFFI) learns how to unmashall by
> itself?
> Currently primitive 120 will fail (I could eventually remove the primitive
> number but I don't remember if some image side code depends on it...)
>
> Note that I have a kind of SmalltalkPattern that will answer the already
> packed list of arguments if # >15 (temp 0 above)
> SmalltalkPatternForCodeGeneration>>arguments
>     ^arrayArgumentVariable
>             ifNil: [argumentNodes asArray]
>             ifNotNil: [Array with: arrayArgumentVariable].
>
> SmalltalkPattern>>bindArg: name encoder: encoder
>     "Accumulate another argument.
>     Check byte code limit and work around"
>     | node |
>     argumentNames size = 15
>         ifTrue: [arrayArgumentVariable := encoder resetArgumentsAsArray:
> #singleArrayOfArguments.
>             argumentNodes := encoder
>                         rebindArguments: argumentNames
>                         asArray: arrayArgumentVariable
>                         pattern: self].
>     argumentNames add: name.
>     arrayArgumentVariable isNil
>         ifTrue: [node := encoder bindTemp: name]
>         ifFalse: [node := self buildAccessorForArgumentRank: argumentNames
> size encoder: encoder.
>             encoder bind: name to: node].
>     node nowHasDef nowHasRef.
>     argumentNodes add: node.
>     ^ node
>
> For Marshalling, I already use the BraceNode, which we did not care yet to
> optimize...
> SmallapackMessageNode>>sizeCodeForValue: encoder
>     | total argSize |
>     arguments size > 15 ifFalse: [^super sizeCodeForValue: encoder].
>     receiver == NodeSuper
>         ifTrue: [selector := selector copy "only necess for splOops"].
>     total := selector
>                 sizeCode: encoder
>                 args: 1
>                 super: receiver == NodeSuper.
>     receiver == nil
>         ifFalse: [total := total + (receiver sizeCodeForValue: encoder)].
>     "re-use equalNode for creating the array argument... not a very clean
> hack"
>     equalNode := BraceNode new elements: arguments.
>     argSize := equalNode sizeCodeForValue: encoder.
>     total := total + argSize.
>     sizes := Array with: argSize.
>     ^ total
>
> 97 <70> self
> 98 <75> pushConstant: 0
> 99 <E0> send: cIntegerPointerOn:
> 100 <82 4F> popIntoTemp: 15
> 102 <70> self
> 103 <43> pushLitVar: #Array=>Array
> 104 <24> pushConstant: 16
> 105 <E2> send: braceStream:
> 106 <88> dup
> 107 <70> self
> 108 <10> pushTemp: 0
> 109 <E5> send: cLogicalPointerOn:
> 110 <C4> send: nextPut:
> 111 <87> pop
> 112 <88> dup
> 113 <70> self
> 114 <11> pushTemp: 1
> 115 <E5> send: cLogicalPointerOn:
> 116 <C4> send: nextPut:
> 117 <87> pop
> 118 <88> dup
> 119 <70> self
> 120 <12> pushTemp: 2
> 121 <E0> send: cIntegerPointerOn:
> 122 <C4> send: nextPut:
> 123 <87> pop
> 124 <88> dup
> 125 <13> pushTemp: 3
> 126 <C4> send: nextPut:
> 127 <87> pop
> 128 <88> dup
> 129 <70> self
> 130 <14> pushTemp: 4
> 131 <E0> send: cIntegerPointerOn:
> 132 <C4> send: nextPut:
> 133 <87> pop
> 134 <88> dup
> 135 <15> pushTemp: 5
> 136 <C4> send: nextPut:
> 137 <87> pop
> 138 <88> dup
> 139 <70> self
> 140 <16> pushTemp: 6
> 141 <E0> send: cIntegerPointerOn:
> 142 <C4> send: nextPut:
> 143 <87> pop
> 144 <88> dup
> 145 <17> pushTemp: 7
> 146 <C4> send: nextPut:
> 147 <87> pop
> 148 <88> dup
> 149 <70> self
> 150 <18> pushTemp: 8
> 151 <E0> send: cIntegerPointerOn:
> 152 <C4> send: nextPut:
> 153 <87> pop
> 154 <88> dup
> 155 <19> pushTemp: 9
> 156 <C4> send: nextPut:
> 157 <87> pop
> 158 <88> dup
> 159 <70> self
> 160 <1A> pushTemp: 10
> 161 <E0> send: cIntegerPointerOn:
> 162 <C4> send: nextPut:
> 163 <87> pop
> 164 <88> dup
> 165 <1B> pushTemp: 11
> 166 <C4> send: nextPut:
> 167 <87> pop
> 168 <88> dup
> 169 <1C> pushTemp: 12
> 170 <C4> send: nextPut:
> 171 <87> pop
> 172 <88> dup
> 173 <1D> pushTemp: 13
> 174 <C4> send: nextPut:
> 175 <87> pop
> 176 <88> dup
> 177 <70> self
> 178 <1E> pushTemp: 14
> 179 <E0> send: cIntegerPointerOn:
> 180 <C4> send: nextPut:
> 181 <87> pop
> 182 <88> dup
> 183 <1F> pushTemp: 15
> 184 <C4> send: nextPut:
> 185 <87> pop
> 186 <D6> send: braceArray
> 187 <E1> send:
> #xtgexcWithwantq:wantz:n:a:lda:b:ldb:q:ldq:z:ldz:ifst:ilst:work:lwork:info:
> (1 arg)
> 188 <87> pop
> 189 <1F> pushTemp: 15
> 190 <D8> send: getHandle
> 191 <76> pushConstant: 1
> 192 <E7> send: signedLongAt:
> 193 <7C> returnTop
>
> For unmarshalling, I did not care too much to optimize the bytecode
> either, because most use cases are just wrappers to FFI:
> SmalltalkPatternForCodeGeneration>>buildAccessorForArgumentRank: index
> encoder: encoder
>     | indexNode |
>     indexNode := encoder encodeLiteral: index.
>     ^ MessageForTooManyArgNode new
>         receiver: arrayArgumentVariable
>         selector: #at:
>         arguments: (Array with: indexNode)
>         precedence: 3
>         from: encoder
>
> Note that the quick hacks that I already have could be transformed to the
> form you suggest (14 args + excess_array) without too much efforts if you
> think that primitiveCalloutToFFI (120) can be re-written to handle
> 14+excess...
>
> I patched the old Squeak compiler in Smallapack to workaround this
>>> limitation (it was easy enough to pass a single Array, and invoke FFI with
>>> many args).
>>> In modern Pharo flavour, this is more involved with the new OpalCompiler
>>> (iit does not seem to be designed for extensibility as it seems necessary
>>> to patch many pieces/subclasses for a single feature change...).
>>>
>>> But we now have Sista V1 bytecodes which removed a lot of limitations (#
>>> inst vars, #literals, max jump offset ...). Alas I don't see a modified
>>> limit for number of arguments (source:
>>> https://hal.inria.fr/hal-01088801/document a bytecode set for adaptive
>>> optimization): there is still a limit of 4 reserved bits in compiled method
>>> header documented in link above.
>>> Though, there is an adjacent unused bit now...
>>> In Squeak,/Pharo, EncoderForSistaV1>>genSend:numArgs: suggests that the
>>> limit is 31 (sic)
>>>
>>>     (nArgs < 0 or: [nArgs > 31]) ifTrue:
>>>         [^self outOfRangeError: 'numArgs' index: nArgs range: 0 to: 31
>>> "!!"].
>>>
>>> or at least 2047 if we believe code below:
>>>
>>>     "234        11101010    i i i i i j j j    Send Literal Selector
>>> #iiiii (+ Extend A * 32) with jjj (+ Extend B * 8) Arguments"
>>>
>>>
>>> https://github.com/pharo-project/pharo/blob/50992c3e5fed790b7e660954aee983f4681da658/src/Kernel-BytecodeEncoders/EncoderForSistaV1.class.st
>>>
>>> Pharo also limit the numArgs to 15 whatever the encoding in
>>> CompiledMethod>>newBytes:trailerBytes:nArgs:nTemps:nStack:nLits:
>>> primitive:
>>>
>>> https://github.com/pharo-project/pharo/blob/50992c3e5fed790b7e660954aee983f4681da658/src/Kernel/CompiledMethod.class.st
>>>
>>> But Squeak does not limit nArgs at all in
>>>
>>> EncoderForSistaV1>>computeMethodHeaderForNumArgs:numTemps:numLits:primitive:
>>>
>>> So my questions:
>>> - is that doc up-to-date?
>>> - if so, couldn't we expand the limit to 31 args by using the unused bit?
>>>
>>> Note: there is another unused bit in V3 (not adjacent), and the double
>>> extended (send) byte code has room for 31 args in V3 too, since only the
>>> first 3 bits of second byte encode the type of operation...
>>>
>>
>>
>> --
>> _,,,^..^,,,_
>> best, Eliot
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20190109/02960507/attachment-0001.html>


More information about the Vm-dev mailing list