Hello again, This time about sub-pixel aliasing.
Andrew Tween writes:
Hi Bryce, I think it is a good idea to release the solid 3.8 version.
Having said that, I am looking forward to the 3.9 release because I really want to try using Exupery on my sub-pixel font filtering algorithm to see if it can speed it up. Currently this is in 3.9, and I don't want to port it all back to an earlier image/vm, especially since you are moving forward to 3.9.
Exupery runs fine on 3.9, the tests just needed to be fixed.
The best way to find out how it performs for your example would be to load Exupery into your 3.9 image and try it.
This is probably a topic for another thread, but could you tell from looking at the attached method if it is a good candidate for speed-up. It has nested loops, does lots of at: and integerAt:Put: (prim 166) , and SmallInteger bitShift: , bitAnd: , *, + , // , and some Float calcs.
I'm not sure how well it would run. The code is definately a promising candidate to compile however Exupery doesn't yet compile Floats, large integers, or primitive 166. I don't think the interpreter does any special optimisations for them either so chances are those operations will run at the same speed. Exupery will be able to optimise the SmallInteger calculations and looping overhead.
The method could definately be optimised much more. Adding integerAt:put: and ByteArray>>at: primitives would help. So would basic floating point optimisations. Going further, adding support for machine word (32 bit integer) and byte objects should allow us to compile to near C speeds.
The optimisations for machine words, bytes objects, and floating point are all very similar. The game is to remove all the intermediate objects so the calculations are done directly in registers without any conversion and deconversion overhead.
luminance := (0.299*balR)+(0.587*balG)+(0.114*balB). balR := balR + ((luminance - balR)*correctionFactor). balG := balG + ((luminance - balG)*correctionFactor). balB := balB + ((luminance - balB)*correctionFactor). balR := balR truncated. balR < 0 ifTrue:[balR := 0] ifFalse:[balR > 255 ifTrue:[balR := 255]]. balG := balG truncated. balG < 0 ifTrue:[balG := 0] ifFalse:[balG > 255 ifTrue:[balG := 255]]. balB := balB truncated. balB < 0 ifTrue:[balB := 0] ifFalse:[balB > 255 ifTrue:[balB := 255]]. a := balR + balG + balB > 0 ifTrue:[16rFF] ifFalse:[0]. colorVal := balB + (balG bitShift: 8) + (balR bitShift: 16) + (a bitShift: 24). answer bits integerAt: (y*answer width)+(x//3+1) put: colorVal.
Is a nice example to show what dynamically inlined primitives could do. The major overhead with floats is allocating memory (1). In this example, using the current optimisation engine it should be possible to create only 4 floats rather than 19 needed by the intepreter. One more allocation will be needed to form colorVal if it overflows into a LargeInteger. SSA should allow all the floating point intermediate values to be removed by allow program analysis over more than one statement.
balR := balR truncated. balR < 0 ifTrue:[balR := 0] ifFalse:[balR > 255 ifTrue:[balR := 255]].
Should probably be handled via a primitive that truncates a floating point value down to an unsigned 8 bit value. For this example such a primitive may be overkill however converting floating point values to. But with Exupery 3.0 and SSA it would be really nice to be able to optimise to vectors. With vector optimisation we will have a level playing field with C, they will need at least as much compiler machinery as we will and they will probably write their compilers in C requiring much more work than writing in Smalltalk.
In summary, I think there may be some speed improvement now. Adding the array access primitives will help. Floating point is likely to be the next biggest win. Without SSA I doubt that other optimisations will provide enough gain to be worthwhile. With SSA and a few extra object types it should be possible to fully optimise it.
Bryce
(1) After upgrading the VM I'm going to implement fast compiled primitives for #new and #@. This is driven by the largeExplorers benchmark. #@ is inlined into the main interpret loop in the interpreter but Exupery executes it as a normal primitive. This means that compiling largeExplorers can lead to a 8% speed loss.