[Vm-dev] Primitive replaceFrom:to:with:startingAt: in the JIT

Clément Bera bera.clement at gmail.com
Tue Jan 23 16:30:48 UTC 2018


Hi Levente,

VMMaker.oscog-cb.2323 tries to solve the slow-down problem you mentioned
here by decreasing the number of instructions in the copying loops from 7
to 5.

On my machine, it seems at 10k elements, copying time improved by a factor
of 1.25.

Copying loop becomes (replReg is adjusted by a diff ahead of the loop)

(pointer):
instr := cogit MoveXwr: startReg R: replReg R: TempReg.
cogit MoveR: TempReg Xwr: startReg R: arrayReg.
cogit AddCq: 1 R: startReg.
cogit CmpR: startReg R: stopReg.
cogit JumpAboveOrEqual: instr.

(bytes):
instr := cogit MoveXbr: startReg R: replReg R: TempReg.
cogit MoveR: TempReg Xbr: startReg R: arrayReg.
cogit AddCq: 1 R: startReg.
cogit CmpR: startReg R: stopReg.
cogit JumpAboveOrEqual: instr.

Since replReg unlike arrayReg is unused afterwards, we could cheat more to
free one register using replReg as the counter use the fixed index read
instr with index 0, but I am too lazy to do it. I wanted to do this since
incrementing 2 counters felt a bit cumbersome and I wanted to try out the
tight loop scheme they use in V8. That's what mostly pays off (removing an
unconditional jump).

I guess there is still some slow down when copying large byte object - if
some-one could compute the threshold at which point the C code is faster
and what is the performance difference factor, tell me, I may change the
jitted code so it falls back to the C primitive over this threshold.

Hopefully next month we'll discuss the #compareTo:[collated:] primitive for
data objects (in [ ] optional parameters, default being ASCII order)...
Much to discuss since we can use pointer comparison for byte objects, even
for the last word since unused bytes are zeroed, of course if we deal with
some narrow cases...

Best,

On Mon, Dec 25, 2017 at 10:57 PM, Levente Uzonyi <leves at caesar.elte.hu>
wrote:

> Hi Clémont,
>
> I finally found the time to write some benchmarks.
> I compared the output of the script below on sqcogspur64linuxht vm
> 201710061559 and 201712221331 avaiable on bintray.
>
> result := { ByteArray. DoubleByteArray. WordArray. DoubleWordArray.
> ByteString. WideString. FloatArray. Array } collect: [ :class |
>         | collection |
>         Smalltalk garbageCollect.
>         collection := class basicNew: 10000.
>         class -> (#(0 1 2 5 10 20 50 100 200 500 1000 2000 5000 10000)
> collect: [ :size |
>                 | iterations time overhead |
>                 iterations := (40000000 // (size max: 1) sqrt) floor.
>                 overhead := [ 1 to: iterations do: [ :i | ] ] timeToRun.
>                 time := [ 1 to: iterations do: [ :i |
>                         collection replaceFrom: 1 to: size with:
> collection startingAt: 1 ] ] timeToRun.
>                 { size. iterations. time - overhead } ]) ].
>
> I found that the quick paths are probably only implented for bytes and
> pointers collections, because there was no significant difference for
> DoubleByteArray, WordArray, DoubleWordArray, WideString and FloatArray.
>
> For pointers and bytes collections, there's significant speedup when the
> copied portion is small. However, somewhere between 50 and 100 copied
> elements, the copying of bytes collections becomes slower (up to 1.5x @
> 100k elements) with the newer VM.
> It's interesting that this doesn't happen to pointers classes. Instead of
> slowdown there's still 1.5x speedup even at 100k elements.
>
> Levente
>
>
> On Mon, 23 Oct 2017, Clément Bera wrote:
>
> Hi all,
>> For a long time I was willing to add primitive
>> #replaceFrom:to:with:startingAt: in the JIT but did not take time to do
>> it. These days I am showing the JIT to one of my students and as an example
>> of how one would write code in the JIT we implemented this primitive
>> together, Spur-only. This is part of commit 2273.
>>
>> I implemented quick paths for byte objects and array-like objects only.
>> The rationale behind this is that the most common cases I see in Pharo user
>> benchmarks in the profiler is copy of arrays and byteStrings. Typically
>> some application benchmarks would show 3-5% of
>> time spent in copying small things, and switching from the JIT runtime to
>> C runtime is an important part of the cost.
>>
>> First evaluation shows the following speed-ups, but I've just done that
>> quickly in my machine:
>>
>> Copy of size 0
>>     Array 2.85x
>>     ByteString 2.7x
>> Copy of size 1
>>     Array 2.1x
>>     ByteString 2x
>> Copy of size 3
>>     Array 2x
>>     ByteString 1.9x
>> Copy of size 8
>>     Array 1.8x
>>     ByteString 1.8x
>> Copy of size 64
>>    Array 1.1x
>>    ByteString 1.1x
>> Copy of size 1000
>>    Array 1x
>>    ByteString 1x
>>
>> So I would expect some macro benchmarks to get 1 to 3% percent speed-up.
>> Not as much as I expected but it's there.
>>
>> Can someone who is good at benchmarks such as Levente have a look and
>> provide us with a better evaluation of the performance difference ?
>>
>> Thanks.
>>
>> --
>> Clément BéraPharo consortium engineer
>> https://clementbera.wordpress.com/
>> Bâtiment B 40, avenue Halley 59650 Villeneuve d'Ascq
>>
>>


-- 
Clément Béra
Pharo consortium engineer
https://clementbera.wordpress.com/
Bâtiment B 40, avenue Halley 59650 Villeneuve d'Ascq
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.squeakfoundation.org/pipermail/vm-dev/attachments/20180123/d4057432/attachment-0001.html>


More information about the Vm-dev mailing list