[Vm-dev] Primitive replaceFrom:to:with:startingAt: in the JIT

Clément Bera bera.clement at gmail.com
Wed Dec 27 20:14:02 UTC 2017


> Hi,
>
> Indeed, I implemented the quick paths only for byte objects and pointer
> objects. The reason for this is that I based the implementation on customer
> benchmarks, and there was a lot of replacements for pointer objects
> (typically, growing the inner array of OrderedCollection/Dictionaries)
> and byte objects (typically bytestrings concatenation, such as 'a' , 'b'
> which calls twice this primitive for only a couple characters and is now
> much faster).
>
> I wrote the quick path with short copies mind, in which case switching to
> the C runtime is too expensive. Based on your analysis, for pointer objects
> it looks good, but for byte objects it gets slower.
>
> So I see two solutions for the slow down on byte copies:
> 1) over a threshold of 1000 elements, fall back to the slang primitive for
> byte objects
> 2) Generate smarter copying machine code (for example with some kind of
> duff device / unrolled loop / overlapping copies with 1 copy non aligned).
> If possible nothing requiring to extend the RTL. Note that memcopy in C is
> written as a per back-end files which is thousands of lines on x86 and we
> don't want to go anywhere near this complexity, in our case the size of
> generate machine code also matters and we want to keep it small. We could
> use pointer size read / writes on byte objects too if we deal with aliasing
> (if array == replArray just fall back to a slower path).
>
> Let's look at the naive machine code in Cog's RTL (I could show the actual
> assembly version if you like it better), in each case the loop does:
> - compare and jumpBelow to see if the copy is finished
> - read one field from repl
> - write the read field in the array
> - increment the 2 registers holding indexes for read/write
> - jump back.
>
> The copying machine code for byte object is as follow:
>
>                        instr := cogit CmpR: startReg R: stopReg.
> jumpFinished := cogit JumpBelow: 0.
> cogit MoveXbr: repStartReg R: replReg R: TempReg.
> cogit MoveR: TempReg Xbr: startReg R: arrayReg.
> cogit AddCq: 1 R: startReg.
> cogit AddCq: 1 R: repStartReg.
> cogit Jump: instr.
>
> The copying code for pointer object is as follow:
> instr := cogit CmpR: startReg R: stopReg.
> jumpFinished := cogit JumpBelow: 0.
> cogit MoveXwr: repStartReg R: replReg R: TempReg.
> cogit MoveR: TempReg Xwr: startReg R: arrayReg.
> cogit AddCq: 1 R: startReg.
> cogit AddCq: 1 R: repStartReg.
> cogit Jump: instr.
>
> Any idea what we could do smarter, but not too complex, without extending
> the RTL and without generating let's say more than 20 extra instructions ?
>
> I think checking for aliasing and using pointer size read/write may be the
> way to go. Then if performance is not better we just fall back to slang
> code.
>
> On Mon, Dec 25, 2017 at 10:57 PM, Levente Uzonyi <leves at caesar.elte.hu>
> wrote:
>
>> Hi Clémont,
>>
>> I finally found the time to write some benchmarks.
>> I compared the output of the script below on sqcogspur64linuxht vm
>> 201710061559 and 201712221331 avaiable on bintray.
>>
>> result := { ByteArray. DoubleByteArray. WordArray. DoubleWordArray.
>> ByteString. WideString. FloatArray. Array } collect: [ :class |
>>         | collection |
>>         Smalltalk garbageCollect.
>>         collection := class basicNew: 10000.
>>         class -> (#(0 1 2 5 10 20 50 100 200 500 1000 2000 5000 10000)
>> collect: [ :size |
>>                 | iterations time overhead |
>>                 iterations := (40000000 // (size max: 1) sqrt) floor.
>>                 overhead := [ 1 to: iterations do: [ :i | ] ] timeToRun.
>>                 time := [ 1 to: iterations do: [ :i |
>>                         collection replaceFrom: 1 to: size with:
>> collection startingAt: 1 ] ] timeToRun.
>>                 { size. iterations. time - overhead } ]) ].
>>
>> I found that the quick paths are probably only implented for bytes and
>> pointers collections, because there was no significant difference for
>> DoubleByteArray, WordArray, DoubleWordArray, WideString and FloatArray.
>
>
>> For pointers and bytes collections, there's significant speedup when the
>> copied portion is small. However, somewhere between 50 and 100 copied
>> elements, the copying of bytes collections becomes slower (up to 1.5x @
>> 100k elements) with the newer VM.
>> It's interesting that this doesn't happen to pointers classes. Instead of
>> slowdown there's still 1.5x speedup even at 100k elements.
>>
>> Levente
>>
>>
>> On Mon, 23 Oct 2017, Clément Bera wrote:
>>
>> Hi all,
>>> For a long time I was willing to add primitive
>>> #replaceFrom:to:with:startingAt: in the JIT but did not take time to do
>>> it. These days I am showing the JIT to one of my students and as an example
>>> of how one would write code in the JIT we implemented this primitive
>>> together, Spur-only. This is part of commit 2273.
>>>
>>> I implemented quick paths for byte objects and array-like objects only.
>>> The rationale behind this is that the most common cases I see in Pharo user
>>> benchmarks in the profiler is copy of arrays and byteStrings. Typically
>>> some application benchmarks would show 3-5% of
>>> time spent in copying small things, and switching from the JIT runtime
>>> to C runtime is an important part of the cost.
>>>
>>> First evaluation shows the following speed-ups, but I've just done that
>>> quickly in my machine:
>>>
>>> Copy of size 0
>>>     Array 2.85x
>>>     ByteString 2.7x
>>> Copy of size 1
>>>     Array 2.1x
>>>     ByteString 2x
>>> Copy of size 3
>>>     Array 2x
>>>     ByteString 1.9x
>>> Copy of size 8
>>>     Array 1.8x
>>>     ByteString 1.8x
>>> Copy of size 64
>>>    Array 1.1x
>>>    ByteString 1.1x
>>> Copy of size 1000
>>>    Array 1x
>>>    ByteString 1x
>>>
>>> So I would expect some macro benchmarks to get 1 to 3% percent speed-up.
>>> Not as much as I expected but it's there.
>>>
>>> Can someone who is good at benchmarks such as Levente have a look and
>>> provide us with a better evaluation of the performance difference ?
>>>
>>> Thanks.
>>>
>>> --
>>> Clément Béra
>>> https://clementbera.wordpress.com/
>>> Bâtiment B 40, avenue Halley 59650 Villeneuve d'Ascq
>>>
>>>
>
>
> --
> Clément Béra
> https://clementbera.wordpress.com/
> Bâtiment B 40, avenue Halley 59650 *Villeneuve d'Ascq*
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20171227/cbb053ab/attachment-0001.html>


More information about the Vm-dev mailing list