Hi Levente,
VMMaker.oscog-cb.2323 tries to solve the slow-down problem you mentioned here by decreasing the number of instructions in the copying loops from 7 to 5.
On my machine, it seems at 10k elements, copying time improved by a factor of 1.25.
Copying loop becomes (replReg is adjusted by a diff ahead of the loop)
(pointer):
instr := cogit MoveXwr: startReg R: replReg R: TempReg.
cogit MoveR: TempReg Xwr: startReg R: arrayReg.
cogit AddCq: 1 R: startReg.
cogit CmpR: startReg R: stopReg.
cogit JumpAboveOrEqual: instr.(bytes):
instr := cogit MoveXbr: startReg R: replReg R: TempReg.
cogit MoveR: TempReg Xbr: startReg R: arrayReg.
cogit AddCq: 1 R: startReg.
cogit CmpR: startReg R: stopReg.
cogit JumpAboveOrEqual: instr.
Since replReg unlike arrayReg is unused afterwards, we could cheat more to free one register using replReg as the counter use the fixed index read instr with index 0, but I am too lazy to do it. I wanted to do this since incrementing 2 counters felt a bit cumbersome and I wanted to try out the tight loop scheme they use in V8. That's what mostly pays off (removing an unconditional jump).
I guess there is still some slow down when copying large byte object - if some-one could compute the threshold at which point the C code is faster and what is the performance difference factor, tell me, I may change the jitted code so it falls back to the C primitive over this threshold.
Hopefully next month we'll discuss the #compareTo:[collated:] primitive for data objects (in [ ] optional parameters, default being ASCII order)... Much to discuss since we can use pointer comparison for byte objects, even for the last word since unused bytes are zeroed, of course if we deal with some narrow cases...
Best,