Exupery for sub-pixel font filtering.

Mon Nov 13 22:05:20 UTC 2006

Hello again,
This time about sub-pixel aliasing.

Andrew Tween writes:
 > Hi Bryce,
 > I think it is a good idea to release the solid 3.8 version.
 > 
 > Having said that, I am looking forward to the 3.9 release because I really want
 > to try using Exupery on my sub-pixel font filtering algorithm to see if it can
 > speed it up. Currently this is in 3.9, and I don't want to port it all back to
 > an earlier image/vm, especially since you are moving forward to 3.9.

Exupery runs fine on 3.9, the tests just needed to be fixed.

The best way to find out how it performs for your example would be to
load Exupery into your 3.9 image and try it.

 > This is probably a topic for another thread, but could you tell from looking at
 > the attached method if it is a good candidate for speed-up. It has nested loops,
 > does lots of at: and integerAt:Put: (prim 166) , and SmallInteger bitShift: ,
 > bitAnd: , *, + , // , and some Float calcs.

I'm not sure how well it would run. The code is definately a promising
candidate to compile however Exupery doesn't yet compile Floats, large
integers, or primitive 166. I don't think the interpreter does any
special optimisations for them either so chances are those operations
will run at the same speed. Exupery will be able to optimise the
SmallInteger calculations and looping overhead.

The method could definately be optimised much more. Adding
integerAt:put: and ByteArray>>at: primitives would help. So would
basic floating point optimisations. Going further, adding support for
machine word (32 bit integer) and byte objects should allow us to
compile to near C speeds.

The optimisations for machine words, bytes objects, and floating point
are all very similar. The game is to remove all the intermediate
objects so the calculations are done directly in registers without any
conversion and deconversion overhead.

  luminance := (0.299*balR)+(0.587*balG)+(0.114*balB).
  balR := balR + ((luminance - balR)*correctionFactor).
  balG := balG + ((luminance - balG)*correctionFactor).
  balB := balB + ((luminance - balB)*correctionFactor).
  balR := balR  truncated.
  balR < 0 ifTrue:[balR := 0] ifFalse:[balR > 255 ifTrue:[balR := 255]].	
  balG := balG  truncated.
  balG < 0 ifTrue:[balG := 0] ifFalse:[balG > 255 ifTrue:[balG := 255]].		
  balB := balB  truncated.
  balB < 0 ifTrue:[balB := 0] ifFalse:[balB > 255 ifTrue:[balB := 255]].	 
  a := balR + balG + balB > 0 ifTrue:[16rFF] ifFalse:[0].
  colorVal := balB + (balG bitShift: 8) +  (balR bitShift: 16) + (a bitShift: 24).
  answer bits integerAt: (y*answer width)+(x//3+1) put: colorVal.

Is a nice example to show what dynamically inlined primitives could
do. The major overhead with floats is allocating memory (1). In this
example, using the current optimisation engine it should be possible
to create only 4 floats rather than 19 needed by the intepreter. One
more allocation will be needed to form colorVal if it overflows into a
LargeInteger. SSA should allow all the floating point intermediate
values to be removed by allow program analysis over more than one
statement.

  balR := balR  truncated.
  balR < 0 ifTrue:[balR := 0] ifFalse:[balR > 255 ifTrue:[balR :=
255]].

Should probably be handled via a primitive that truncates a floating
point value down to an unsigned 8 bit value. For this example such a
primitive may be overkill however converting floating point values
to. But with Exupery 3.0 and SSA it would be really nice to be able to
optimise to vectors. With vector optimisation we will have a level
playing field with C, they will need at least as much compiler
machinery as we will and they will probably write their compilers in C
requiring much more work than writing in Smalltalk.

In summary, I think there may be some speed improvement now. Adding
the array access primitives will help. Floating point is likely to be
the next biggest win. Without SSA I doubt that other optimisations
will provide enough gain to be worthwhile. With SSA and a few extra
object types it should be possible to fully optimise it.

Bryce

(1) After upgrading the VM I'm going to implement fast compiled
primitives for #new and #@. This is driven by the largeExplorers
benchmark. #@ is inlined into the main interpret loop in the
interpreter but Exupery executes it as a normal primitive. This means
that compiling largeExplorers can lead to a 8% speed loss.