back to memcpy eh?

Tim Rowledge tim at sumeru.stanford.edu
Sat Dec 15 18:36:29 UTC 2001


"Noel J. Bergman" wrote:
> 
> > In fact I did one even wilder; when ARMs didn't have instruction caches
> > it was trivial to do self compiling code.
> 
> > Same idea made ARM bitblt go like a rat up a drainpipe even without fast
> > memory, caches, pipelining, all this modern crap.
> 
> Tim,
> 
> Would this be helpful on all of the ARM based PDAs?  You've told me on
> multiple occasions that one of the performance issues is the lack of a large
> cache on the iPAQ.  Would any of this help in the
> sqWin32Window.cpp:iPaqFrameBufferShowDisplayBits[x]() routines?  Or would
> the weight of the loop body swamp any gains from minimizing the looping
> tests? 
It's very hard to say these days; although generating on-the-fly
optimised code will usually give good performance there are a lot of
associated problems that can bite you. Consider as a very local example
the first jitter attempt for Squeak - it did the code generation
perfectly well, but something was completely screwed up in the caching
of the resultant code. As a result it was dismally slow overall,
spending a large proportion of its time retranslating.

On the current ARMs, there is rather poor control of the caches and it
is really expensive to do a flush-and-recover (the actual flush is I
think a single instruction cost) since you can't flush anything but the
whole cache. One approach to tackling that is the mini-cache in the
SA1110 series, which is sort of side cache intendedto help with doing
large block moving. It doesn't 'pollute' the main caches.

Now, this doesn't mean it is entirely a bad idea to generate code. RISC
OS actually appears to generate code to suit the transformations need to
display images depending on the screen setup in use. Of course, that
doesn't change too often so it is something that can be done during boot
or screen-mode changes. Trying to do smaller grain dynamic code (to suit
the particular width or alignment of a bitmap for example) gets you into
those worries about the cost of generation & cache flushing vs the
benefit for this particular display call. There is a lot of stuff in the
VW code generator to try to handle all this.

> I'm rather suspecting the latter, just by eyeball-o-metric analysis,
> but you're the ARM performance guru.
Oh, I wish. I'm well out of date for most of the current cpus :-(


There is, of course, a reasonable chance that the bitmapt display
routines in winCE already usesome of these tricks. M$ might be an awful
business, but they employ a lot of smart people (except in the IE &
LookOut departments, apparently)

tim






More information about the Squeak-dev mailing list