Anybody got a RSS library available in Squeak, that they're willing to share?
--- Brent Vukmer bvukmer@blackboard.com wrote:
Anybody got a RSS library available in Squeak, that they're willing to share?
I've got classes like BlogSite, BlogChannel, BlogItem which understand the MetaWeblog API. They work with an updated version of Morphic Joules(http://swiki.squeakfoundation.org/squeakfoundation/uploads/morphicJoules/mor...) which I'm preparing to release RSN. I'm more than happy to share but won't get to it until the 6th or 7th.
__________________________________ Do you Yahoo!? SBC Yahoo! DSL - Now only $29.95 per month! http://sbc.yahoo.com
Well I know that if I grab and click hold the separator bar in the Browser window just above the senders button, then count to five, I count 80,000 signals to the semaphore, which is spining that ioProcess loop really really fast... As you mention both processes do this together so nothing really gets stacked up. However this is needless work...
PS I've see the glitch pause usually when the cursor has been changed in the browser windows, can't say how or what it is doing yet.
Given that the io process runs at a higher priority than the polling process (which is the case for all uses of Sensor I am aware of) this is impossible. The signal will be served right away by releasing the io process (note that since the polling process is running the io process _must_ be waiting on the semaphore). So there's definitely no grinding through pending signals going on here.
-- ======================================================================== === John M. McIntosh johnmci@smalltalkconsulting.com 1-800-477-2659 Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com ======================================================================== ===
Well I know that if I grab and click hold the separator bar in the Browser window just above the senders button, then count to five, I count 80,000 signals to the semaphore, which is spining that ioProcess loop really really fast... As you mention both processes do this together so nothing really gets stacked up. However this is needless work...
Be that as it may but unless you fix the clients not to use Sensor there is really nothing you can do. Sensor polls and because of the VMs event support we have to make sure we get the appropriate state pulled in from the VMs event buffer. If we don't, then the polling loop will sit there forever.
Besides, signalling the event semaphore from the VM will only work if you have a multi-threaded VM implementation which can process incoming UI events in parallel to the VM's loop. If you don't, then you'll only get new events every 500ms or so (depending on what your VM interrupt check counter says) so not stimulating the input semaphore is not an option given all the non-multithreaded implementations.
I might add that I've been over all of this many, MANY times. While the current version might be inefficient it is the only way to preserve the expected notions from its clients - if you want to reduce the needless work done here then you need to fix the clients to work with our (not so) new assumptions (e.g., being event-driven no client should poll for state and expect it to change in its polling loops). Once you do this, you can rewrite everything in terms of the morphic hand rather than Sensor's polling API (which is the right thing to do).
Cheers, - Andreas
As you mention both processes do this together so nothing really gets stacked up. However this is needless work...
PS. Of course it is entirely irrelevant whether a polling loop like [Sensor anyButtonPressed] whileFalse. busy-loops within the io process or anywhere else. It's the nature of polling loops to waste cycles so it really doesn't matter where you waste them ;-) Note that by your rough estimate it would mean we can run approx. 10k io process loops per second (I would guess that the real number is even dramatically higher) so even trying to be a _little_ less agressively polling by just waiting a single millisecond would mean we'd reduce the load on the system by more than 90% ;-)))
Cheers, - Andreas
On Thu, Jul 03, 2003 at 04:55:29AM +0200, Andreas Raab wrote:
Be that as it may but unless you fix the clients not to use Sensor there is really nothing you can do. Sensor polls and because of the VMs event support we have to make sure we get the appropriate state pulled in from the VMs event buffer. If we don't, then the polling loop will sit there forever.
Besides, signalling the event semaphore from the VM will only work if you have a multi-threaded VM implementation which can process incoming UI events in parallel to the VM's loop. If you don't, then you'll only get new events every 500ms or so (depending on what your VM interrupt check counter says) so not stimulating the input semaphore is not an option given all the non-multithreaded implementations.
Why 500 ms? You can check from ioProcessEvents() and get them much more frequently. Does that function really only get called every 500 ms? That would be a problem on single-threaded VM's, e.g. for sound buffering.
By the way, these messages I'm posting are not theoretical. They are closely based on an actual implementation from a few years ago that predates the current one. It really does work, even if the user is switching between Morphic and MVC. I'm pretty sure I only checked for events in ioProcessEvents() and in the input primitives themselves.
I also did queue-flushing within the VM, so that I wouldn't have to have a limit on the number of events, but admittedly that's a matter of choosing your poison.
if you want to reduce the needless work done here then you need to fix the clients to work with our (not so) new assumptions (e.g., being event-driven no client should poll for state and expect it to change in its polling loops). Once you do this, you can rewrite everything in terms of the morphic hand rather than Sensor's polling API (which is the right thing to do).
Yeah, they should mostly be fixed anyway, at least unless you are working in MVC. :)
But also, polling Sensor should be a fine thing to do. One of the selling points of Squeak at Georgia Tech is that students are in control and that things they think will work, do work. In fact, it's not uncommon for them to write GUI code manually, where they watch the Sensor and update things on Display as appropriate. The goal is not *only* to have Morphic be happy, but -- at least for us -- to have a simple and comprehensible system.
Lex
Why 500 ms? You can check from ioProcessEvents() and get them much more frequently.
But ioProcessEvents get only called every 500ms unless you query explicitly. Which is exactly what stimulating the input semaphore does.
Does that function really only get called every 500 ms? That would be a problem on single-threaded VM's, e.g. for sound buffering.
Huh? What has sound buffering to do with how often ioProcessEvents gets called?
By the way, these messages I'm posting are not theoretical. They are closely based on an actual implementation from a few years ago that predates the current one. It really does work, even if the user is switching between Morphic and MVC. I'm pretty sure I only checked for events in ioProcessEvents() and in the input primitives themselves.
We're mixing up a few independent discussions here. If you _do_ in fact query a primitive which calls ioProcessEvents() then you're right. However, notice that John's proposed solution did _not_ query ioProcessEvents but rather relied on it being called from "somewhere else" (namely from a different thread). In a single-threaded VM this solution will only update the input state every 500ms.
But also, polling Sensor should be a fine thing to do. One of the selling points of Squeak at Georgia Tech is that students are in control and that things they think will work, do work. In fact, it's not uncommon for them to write GUI code manually, where they watch the Sensor and update things on Display as appropriate. The goal is not *only* to have Morphic be happy, but -- at least for us -- to have a simple and comprehensible system.
Sure - the problem is not the polling of sensor but the difference in assumptions of how events are handled. If you mix up the notions of Morphic with that of MVC you're going to run into trouble. Note that we added buffered event support to address a real problem - namely that of missing events (such as clicks) due to polling delays. The problem with accessing Sensor directly is that it interferes with some deep notions of Morphic. If you write your own UI framework you are free to do whatever you want but if you want to play within the existing framework then you better agree on the rules.
Cheers, - Andreas
"Andreas Raab" andreas.raab@gmx.de wrote:
Why 500 ms? You can check from ioProcessEvents() and get them much more frequently.
But ioProcessEvents get only called every 500ms unless you query explicitly. Which is exactly what stimulating the input semaphore does.
Whoa.... The Unix VM relies on being polled for almost every event it can generate. That includes sound buffering and socket events. This would explain at least some of the wierd socket performance people on Squeak/Unix have reported: if you note that the idle loop will wake up *immediately* in response to an event, you get to the situation where a busy Squeak will take 0.5 seconds to respond to a socket event, while an idle Squeak responds immediately. This should also cause problems for sound playing when Squeak is busy.
Yep! I just tried playing the Fun with Music song with two slightly different busy loops:
[ true ] whileTrue: [ ] "hiccups at a 0.5 second interval"
[ true ] whileTrue: [Sensor mousePoint] "plays beautifully"
How about we use a 1 ms interval instead of 500 ms? Platforms that don't use the function won't notice the difference, because they'll have the functions stubbed out. However, platforms that do use the function, will have a remarkable improvement in the latency between events and when Squeak notices them. In short, if you need this function, you really need it! Is there much reasoning being 500 ms or was that just what got stuck in there at the time?
The problem with accessing Sensor directly is that it interferes with some deep notions of Morphic. If you write your own UI framework you are free to do whatever you want but if you want to play within the existing framework then you better agree on the rules.
In general, yes, but just *looking* at the mouse state should not throw Morphic into a tizzy, should it? Is this quantum Squeak? :)
I had the impression that John was seeing actual laggy *behavior*, not just a lot of signals to the semaphore. If there was no observable behavior (short of someone prodding with tracing tools :)) then never mind me.
Similarly, accessing Display in general will cause problems if Morphic is running, but also the problems you see are predictable based on the most intuitive mental model, ie that Morphic refreshes its stuff onto Display whenever it feels like it. Reading from Display won't cause Morphic to even notice.
Lex
Hi Lex,
Whoa.... The Unix VM relies on being polled for almost every event it can generate.
Hm ... really interesting. I was actually considering to switch to the aioPoll mechanism (mostly for not having to deal with "wild threads" in a few not-so-time-critical areas where it is known that things will take a while and sub-millisecond responses don't matter as much) but this sheds a new light on the issue...
How about we use a 1 ms interval instead of 500 ms? Platforms that don't use the function won't notice the difference, because they'll have the functions stubbed out. However, platforms that do use the function, will have a remarkable improvement in the latency between events and when Squeak notices them. In short, if you need this function, you really need it! Is there much reasoning being 500 ms or was that just what got stuck in there at the time?
Well, actually the Unix VM abuses ioProcessEvents here as this is supposed to handle user input. The only reason why it gets called every 500ms at all is to allow interrupt keys to get through even if in busy loop (in a non-threaded VM you would loop forever otherwise).
I guess what we _should_ be doing here is something in #checkForInterrupts which covers the "poll often for io activity" case (as fetching user input is something that is typically expensive and I probably wouldn't want to mix up the "quick" with the "slow" check). So, for example, we could change it to something like:
Interpreter>>checkForInterrupts "... yaddaya ..." now = lastTick ifFalse:[self ioPollEvents].
which will check as often as your msec resolution supports (note that this requires the interruptChecksEveryNms to be set appropriately too). I'd be happy with the above.
Oh, by the way, one trick I'm using lately to get msecs resolution for timers is to provide a hardware interrupt (similar to itimer) which resets the interruptCheckCounter therefore forcing an interrupt check every msec. This has not shown any negative side effects sofar (running 1000 checks per second is really cheap these days ;-) and would give you a guarantueed msecs response check for the above.
The problem with accessing Sensor directly is that it interferes with some deep notions of Morphic. If you write your own UI framework you are free to do whatever you want but if you want to play within the existing framework then you better agree on the rules.
In general, yes, but just *looking* at the mouse state should not throw Morphic into a tizzy, should it? Is this quantum Squeak? :)
In a way, it really is! Much of this is in the (mostly outdated) interwoven expectations by various clients. There are some contradicting assumptions in the interplay between Sensor's original behavior and what is provided by EventSensor. For example, in order for EventSensor to provide any "current" state (e.g., Sensor's state is polling and EventSensor simulates the behavior) it needs to process any pending events from the VMs event queue. And it just so happens that it also needs to dump the event afterwards as otherwise it might report this event to Morphic. Consider what this means for something like "Rectangle fromUser" - here, you'd get the mouseDown/mouseUp events _after_ Rectangle fromUser completes.
Because of the above (polling Sensor and assuming that anything seen by it cannot be reported afterwards) there are some quantum effects here ;-)
I had the impression that John was seeing actual laggy *behavior*, not just a lot of signals to the semaphore. If there was no observable behavior (short of someone prodding with tracing tools :)) then never mind me.
I don't exactly know what John's observation was (or more exactly: I don't know how to duplicate it) so I can't really comment on it.
Similarly, accessing Display in general will cause problems if Morphic is running, but also the problems you see are predictable based on the most intuitive mental model, ie that Morphic refreshes its stuff onto Display whenever it feels like it. Reading from Display won't cause Morphic to even notice.
Well, no. First of all, you don't expect Display to _change_ while you read it - but you do expect Sensor to change (BIG difference - if we can agree that a busy-loop will always answer the same if executed on Sensor we'd have none of the trouble).
And secondly, you may _not_ get any predicable results if you bypass all of Morphic's mechanisms - as an example, the Balloon engine may (and for the most part will) batch its operations and only after it executes #flush the effects are known to be visible.
Cheers, - Andreas
I'll just repeat my suggestion that we try to split out some of this stuff so that platform specific VM code can handle it when needed. A few macros etc is all it should take. Then we can get rid of some ugliness - just look at checkForInterrupts().
"Andreas Raab" andreas.raab@gmx.de wrote:
Oh, by the way, one trick I'm using lately to get msecs resolution for timers is to provide a hardware interrupt (similar to itimer) which resets the interruptCheckCounter therefore forcing an interrupt check every msec.
Sound nice. I should make an attempt at using that myself.
In a way, it really is! Much of this is in the (mostly outdated) interwoven expectations by various clients. There are some contradicting assumptions in the interplay between Sensor's original behavior and what is provided by EventSensor.
It really is way past time that the Sensor/EventSensor stuff was rewritten to clean it up properly. We orginally had the EventSensor to allw the then-new event driven VMs to work and left the Sensor for older VMs and MVC projects. These days EventSensor is used in MVC projects as well and it handles (however badly) the non-event VMs. Let's simplificationize the whole thing.
tim -- Tim Rowledge, tim@sumeru.stanford.edu, http://sumeru.stanford.edu/tim Strange OpCodes: PNG: Pass Noxious Gas
Besides, signalling the event semaphore from the VM will only work if you have a multi-threaded VM implementation which can process incoming UI events in parallel to the VM's loop. If you don't, then you'll only get new events every 500ms or so (depending on what your VM interrupt check counter says) so not stimulating the input semaphore is not an option given all the non-multithreaded implementations.
Well yes the mac os-x VM is a muti-threaded VM that has a separate thread for the UI events. That's part of the carbon/cocoa event model. In fact the UI thread uses pthread locking to deposit events onto the Squeak VM thread.
What I've done as a first pass is create a message to replace the inputSemaphore signal, that way one can check out doing a Delay waitFor 1 millisecond to see how that affects cpu load.
As a subclass I've created a EventSensorVMDriven class which does nothing for the inputSemphoreSignal method, and relies on the mutli-threaded event driven VM to signal the semaphore when events actually are constructed and placed on the VM event queue. As part of the EventSensor install command I install the correct class based on the OS type and version information.
This seems to work fine. Others are welcome to try it and give feedback on how it feels or if there are issues. Certainly I would like to hear about issue on multi-cpu macintoshes.
Please ensure the global Sensor is the right class after you've filed in and installed the changes.
Technically I can move towards this model on OS-9 too because it's pseudo -threaded underneath, but the signal inputSemaphore has been commented out for many years...
-- ======================================================================== === John M. McIntosh johnmci@smalltalkconsulting.com 1-800-477-2659 Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com ======================================================================== ===
On Wed, Jul 02, 2003 at 11:27:22PM -0700, John M McIntosh wrote:
Besides, signalling the event semaphore from the VM will only work if you have a multi-threaded VM implementation which can process incoming UI events in parallel to the VM's loop. If you don't, then you'll only get new events every 500ms or so (depending on what your VM interrupt check counter says) so not stimulating the input semaphore is not an option given all the non-multithreaded implementations.
Well yes the mac os-x VM is a muti-threaded VM that has a separate thread for the UI events. That's part of the carbon/cocoa event model. In fact the UI thread uses pthread locking to deposit events onto the Squeak VM thread.
As an aside, the Unix VM with an X display could probably be made to do this without resorting to pthreads. All external events are going to originate from one of: 1) external files 2) sockets 3) events originating from the X server (which arrive on a socket channel) 4) OS signals
The first three can all be handled by the aio mechanism in Ian's VM, and the forth can be handled in the obvious way. All four can therefore trigger a Squeak Semaphore, and no VM level threads would be required.
Dave
On Friday 04 July 2003 07:53 am, David T. Lewis wrote:
As an aside, the Unix VM with an X display could probably be made to do this without resorting to pthreads. All external events are going to originate from one of:
- external files
- sockets
- events originating from the X server (which arrive on a socket
channel) 4) OS signals
The first three can all be handled by the aio mechanism in Ian's VM, and the forth can be handled in the obvious way. All four can therefore trigger a Squeak Semaphore, and no VM level threads would be required.
Yes, but for #3 if you wanted to trigger a Squeak Semaphore directly from the X server, you'd have to move X event handling up into Squeak, right? Right now, the aio handler for the X server socket doesn't directly trigger a SqueakSemaphore (at least not until there's some mouse/keyboard/drag event ready).
On Fri, Jul 04, 2003 at 08:16:50AM -0700, Ned Konz wrote:
On Friday 04 July 2003 07:53 am, David T. Lewis wrote:
As an aside, the Unix VM with an X display could probably be made to do this without resorting to pthreads. All external events are going to originate from one of:
- external files
- sockets
- events originating from the X server (which arrive on a socket
channel) 4) OS signals
The first three can all be handled by the aio mechanism in Ian's VM, and the forth can be handled in the obvious way. All four can therefore trigger a Squeak Semaphore, and no VM level threads would be required.
Yes, but for #3 if you wanted to trigger a Squeak Semaphore directly from the X server, you'd have to move X event handling up into Squeak, right? Right now, the aio handler for the X server socket doesn't directly trigger a SqueakSemaphore (at least not until there's some mouse/keyboard/drag event ready).
Yes. Presumably you would want some generalized representation of external events and queueing of events to the image, with the VM waking up the image whenever something interesting happens. Not everything that came in through the X channel would be interesting to the Squeak image, so the X support in the VM could exercise some discretion as to what events from the X channel need to be forwarded to the image.
In principle you would handle all of the X events in the image, with some plugin to allow the image to control the X display. But I suspect that would be rather over the top since this is all working fine in the existing VM already.
I'm getting way out of my depth at this point. I just wanted to mention that the aio mechanism can be made quite general, and on some platforms that VM threads would not be necessary if you were going to support aio anyway. That would allow the timer polling to be reduced or eliminated with no need for a threaded VM. A threaded VM might or might not be desirable for other reasons, but handling of external events not is a reason to require it, at least not if the platform is already capable of doing aio handling.
Dave
p.s. I finally caught up with the rest of this thread, and I would not be surprised if the work that Lex mentioned that he did a few years ago did exactly this kind of thing.
p.p.s. The X protocol itself provides a working example of a cross-platform representation of queued events, so apparently this is not an impossible thing to do.
On Fri, Jul 04, 2003 at 10:53:10AM -0400, David T. Lewis wrote:
As an aside, the Unix VM with an X display could probably be made to do this without resorting to pthreads. All external events are going to originate from one of:
- external files
- sockets
- events originating from the X server (which arrive on a socket channel)
- OS signals
The first three can all be handled by the aio mechanism in Ian's VM, and the forth can be handled in the obvious way. All four can therefore trigger a Squeak Semaphore, and no VM level threads would be required.
I just posted an AioPlugin goodie with examples to demonstrate the event notification mechanism for async files, sockets, Unix pipes, and the standard input stream.
Dave
While it's neat to think about where these calls are coming from, IMHO it doesn't seem valuable to try and remove them. If you want to speed things up, try reducing the number of redundent things that get redrawn for common operations in Morphic. :)
Lex
Well it's idle curiosity, and fiddling with Apple's new set of hardware analysis tools. That pointed to the fact we were doing a millisecond clock check every 2ish milliseconds from the Smalltalk code when the system is IDLE. If it's idle why is it so busy....
PS part of this was discovering how to tune the Mpeg plugin to take a test case from 49.38 to 69.67 frames per second, that is quite a major change (gcc 3.3 & some code changes) it was not at all obvious from earlier codewarrior based tuning tools. So it's not all wasted effort. and I did learn for the mpeg plugin the number of syscalls required to do disk io is low, a question in the past which requires no further thought.
Funny you should mention Morphic drawing
As part of my work I did insert a Transcript print into the ioprocess loop. That results in a slowdown of the entire drawing cycle and you get to observe some *very interesting* drawing behaviour.
I've also noted that in BitBltSimulation>>copyLoop you've got.
2 to: nWords-1 do: [ :word | "Note loop starts with prevWord loaded (due to preload)" self dstLongAt: destIndex put: prevWord. destIndex _ destIndex + hInc. prevWord _ self srcLongAt: sourceIndex. sourceIndex _ sourceIndex + hInc]]]
This resolves (with my previous copyloop code change to) for (word = 2; word <= nWordsMinusOne; word += 1) { longAtput(destIndexLocal, prevWord); destIndexLocal += hInc; prevWord = longAt(sourceIndexLocal); sourceIndexLocal += hInc; }
mmm which really implies you want to move a bunch of words from sourceIndexLocal to destIndexLocal assuming hInc is 4 with special care for the first and last words based on masking bits. So with the above C, does your compiler really optimize the assembler instructions correctly to make it happy?
-- ======================================================================== === John M. McIntosh johnmci@smalltalkconsulting.com 1-800-477-2659 Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com ======================================================================== ===
Hi john
Have you ever noticed that sometimes the complete windows get redrawn extremely slowly on your mac? We noticed with nathanael that this was related with the use of different processes. I tried several times to reproduce it without success. I noticed that this behavior happens when I get a deprecated method called and once I click proceed, then the redraw gets really slow.
Stef
On Thursday, July 3, 2003, at 08:52 AM, John M McIntosh wrote:
While it's neat to think about where these calls are coming from, IMHO it doesn't seem valuable to try and remove them. If you want to speed things up, try reducing the number of redundent things that get redrawn for common operations in Morphic. :)
Lex
Well it's idle curiosity, and fiddling with Apple's new set of hardware analysis tools. That pointed to the fact we were doing a millisecond clock check every 2ish milliseconds from the Smalltalk code when the system is IDLE. If it's idle why is it so busy....
PS part of this was discovering how to tune the Mpeg plugin to take a test case from 49.38 to 69.67 frames per second, that is quite a major change (gcc 3.3 & some code changes) it was not at all obvious from earlier codewarrior based tuning tools. So it's not all wasted effort. and I did learn for the mpeg plugin the number of syscalls required to do disk io is low, a question in the past which requires no further thought.
Funny you should mention Morphic drawing
As part of my work I did insert a Transcript print into the ioprocess loop. That results in a slowdown of the entire drawing cycle and you get to observe some *very interesting* drawing behaviour.
I've also noted that in BitBltSimulation>>copyLoop you've got.
2 to: nWords-1 do: [ :word | "Note loop starts with prevWord loaded (due to
preload)" self dstLongAt: destIndex put: prevWord. destIndex _ destIndex + hInc. prevWord _ self srcLongAt: sourceIndex. sourceIndex _ sourceIndex + hInc]]]
This resolves (with my previous copyloop code change to) for (word = 2; word <= nWordsMinusOne; word += 1) { longAtput(destIndexLocal, prevWord); destIndexLocal += hInc; prevWord = longAt(sourceIndexLocal); sourceIndexLocal += hInc; }
mmm which really implies you want to move a bunch of words from sourceIndexLocal to destIndexLocal assuming hInc is 4 with special care for the first and last words based on masking bits. So with the above C, does your compiler really optimize the assembler instructions correctly to make it happy?
--
==== John M. McIntosh johnmci@smalltalkconsulting.com 1-800-477-2659 Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com ======================================================================= ====
Now that Apple is going to ship general purpose 64 bit machines I'll point out that I'm quite willing (with others) to work on a 64 bit version of squeak on a shinny new G5. However the only issue here is that someone will need to pay for all of this, so if someone has some funding, I could use a replacement desktop machine and work for the fall/winter of 2003/2004.
:-}
-- ======================================================================== === John M. McIntosh johnmci@smalltalkconsulting.com 1-800-477-2659 Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com ======================================================================== ===
I think that John definately deserves a a new G5! But until then, at least folks on old-school (ha!) 32 bit machines can still compile optimized code for the G5 using the dev tools update posted at the ADC.
Regards, Aaron (who can't wait to see what John pulls out of his magic hat for the G5 squeak!)
-- "a system based on exchanging products inevitably channels wealth to a few, and no governmental change will ever be able to correct that." :: daniel quinn
On Thu, 3 Jul 2003, John M McIntosh wrote:
Now that Apple is going to ship general purpose 64 bit machines I'll point out that I'm quite willing (with others) to work on a 64 bit version of squeak on a shinny new G5. However the only issue here is that someone will need to pay for all of this, so if someone has some funding, I could use a replacement desktop machine and work for the fall/winter of 2003/2004.
:-}
--
=== John M. McIntosh johnmci@smalltalkconsulting.com 1-800-477-2659 Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com ======================================================================== ===
John M McIntosh wrote:
Now that Apple is going to ship general purpose 64 bit machines I'll point out that I'm quite willing (with others) to work on a 64 bit version of squeak on a shinny new G5. However the only issue here is that someone will need to pay for all of this, so if someone has some funding, I could use a replacement desktop machine and work for the fall/winter of 2003/2004.
:-}
While you're at it, make that two machines. ;)
Seriously though, this would be a great thing to have...Squeak might even be able to say that it's the first Smalltalk VM. (and probably the first open source OO VM) that is 64 bit. If you do it, make sure that an image saved from a 64 bit VM can be loaded into a 32 bit VM (providing it fits within 32 bits of address space) and vice versa.
That raises another interesting question...presumably, the only real need for 64 bits (other than perhaps CPU performance) is when your image grows larger than 4 GB...in an image of that size or larger, I imagine full GC is going to take quite a while to run. And, at times I imagine you'd want to collect more than the incremental GC, but without waiting on a lengthy full GC. Would it be useful in a 64 bit object memory to introduce more generations?
- Stephen
On Thu, 3 Jul 2003, Stephen Pair wrote:
Seriously though, this would be a great thing to have...Squeak might even be able to say that it's the first Smalltalk VM. (and probably the first open source OO VM) that is 64 bit. If you do it, make sure that an image saved from a 64 bit VM can be loaded into a 32 bit VM (providing it fits within 32 bits of address space) and vice versa.
What about variuos Smalltalks (VAST, VW, GemStone/S) for IBM AIX POWER machines? I suppose they could be running in some mix of what someone could call 32- and 64-bit... For that matter, what about Solaris and IRIX, both of which have some smalltalk or another...
Regards, Aaron
-- "if i don't stay true to live and hate, how do i differentiate between chasing cream and chasing dreams" :: atmosphere
Aaron J Reichow wrote:
On Thu, 3 Jul 2003, Stephen Pair wrote:
Seriously though, this would be a great thing to have...Squeak might even be able to say that it's the first Smalltalk VM. (and probably the first open source OO VM) that is 64 bit. If you do it, make sure that an image saved from a 64 bit VM can be loaded into a 32 bit VM (providing it fits within 32 bits of address space) and vice versa.
What about variuos Smalltalks (VAST, VW, GemStone/S) for IBM AIX POWER machines? I suppose they could be running in some mix of what someone could call 32- and 64-bit... For that matter, what about Solaris and IRIX, both of which have some smalltalk or another...
Regards, Aaron
AFAIK none of those have 64 bit object pointers...which is what I assume John meant when says 64 bit VM I believe that is a problem for some GemStone customers who've had to employ various techniques to limit the consumption of OOPs (to prevent them from running out). Also, I've heard of a few VW customers that keep very large amounts of objects in memory and approach that 32 bit limitation.
- Stephen
On Thu, 3 Jul 2003, Stephen Pair wrote:
AFAIK none of those have 64 bit object pointers...which is what I assume John meant when says 64 bit VM I believe that is a problem for some GemStone customers who've had to employ various techniques to limit the consumption of OOPs (to prevent them from running out). Also, I've heard of a few VW customers that keep very large amounts of objects in memory and approach that 32 bit limitation.
Ah yes, I think you may be right. When I interned at Progressive Insurance (who use VAST and GS/S for that app), they were in the middle of a project fixing up a design error made early on: using symbols instead of strings for things like #Y, #N and other constants. With strings, they were running out of OOPs and generally slowing things down...
Regards, Aaron
-- "the profit system follows the path of least resistance and following the path of least resistance is what makes a river crooked." :: u. utah phillips
On Thu, Jul 03, 2003 at 03:20:14PM -0400, Stephen Pair wrote:
That raises another interesting question...presumably, the only real need for 64 bits (other than perhaps CPU performance) is when your image grows larger than 4 GB...
Another thing that could be worth some experiments would be multiple tagbits: We now use one bit to distinguish between pointers to objects and ints. With 64 bit we could have much more tag bits and then encode many objects in the way we now encode ints: 64 bits should be enough to play with immediate points, floats, chars... could be interesting.
Marcus
John
We should win a way to get money in the community. The CD marcus sets up is one way. For my side I'm ready to pay 100$ (I know this is peanuts) to get a G5 VM for Squeak as soon as I have a bill. So have you thought about opening an account where you could pay for the mac VM? or any other clever solution?
Stef
On Thursday, July 3, 2003, at 09:03 PM, John M McIntosh wrote:
Now that Apple is going to ship general purpose 64 bit machines I'll point out that I'm quite willing (with others) to work on a 64 bit version of squeak on a shinny new G5. However the only issue here is that someone will need to pay for all of this, so if someone has some funding, I could use a replacement desktop machine and work for the fall/winter of 2003/2004.
:-}
--
==== John M. McIntosh johnmci@smalltalkconsulting.com 1-800-477-2659 Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com ======================================================================= ====
Based on feedback I've posted new mpeg plugins for os-9 and os-x.
These will appear in the usual places in a day or two.
For os-9 you can expect an 8% or better increase in frame rate per second. OS-X users have reported much better depending I think on the CPU types. -- ======================================================================== === John M. McIntosh johnmci@smalltalkconsulting.com 1-800-477-2659 Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com ======================================================================== ===
I imagine full GC is going to take quite a while to run. And, at times I imagine you'd want to collect more than the incremental GC, but without waiting on a lengthy full GC. Would it be useful in a 64 bit object memory to introduce more generations?
- Stephen
Nope, you forget as I did (and Tim then reminded me) that you just fall back to a reference counting GC.
a) you've got quite a few more bits around for making the header bigger (32bits would be enough) b) 64bit machines usually have quite a few extra integer execution engines and you can hide the reference counting math in the regular execution of the reference update, so there is no cost.
c) Mmm I wonder if you forget to do GC compaction how long you could run? You just keep allocating new pages from OS virtual memory.
PS This does raise the question would reference counting work better on high end 32bit CPU today? Wonder if anyone has gone back to try this?
-- ======================================================================== === John M. McIntosh johnmci@smalltalkconsulting.com 1-800-477-2659 Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com ======================================================================== ===
John M McIntosh writes:
Nope, you forget as I did (and Tim then reminded me) that you just fall back to a reference counting GC.
a) you've got quite a few more bits around for making the header bigger (32bits would be enough)
A performance problem with reference counting is you need to both read the reference count then write it for each change. With a generational collector a single write can update the write barrier.
Reads and writes are going to bottleneck on memory bandwidth. The performance penalty of memory access is growing as CPUs increase in speed faster than memory.
It's worse, when creating a new object and filling it's instance variables the write barrier will have good locality while the reference counter will modify several objects, most likely in separate cache lines. The Athlon will combine several small writes into a single large write. So a generational system would optimally end up with a single write while a reference counter would need a read and a write for each variable stored.
b) 64bit machines usually have quite a few extra integer execution engines and you can hide the reference counting math in the regular execution of the reference update, so there is no cost.
It's not the math that will hurt.
c) Mmm I wonder if you forget to do GC compaction how long you could run? You just keep allocating new pages from OS virtual memory.
Does the Mac provide easy useful access to the memory management unit? Using the MMU to handle the write barrier could allow an efficient background collector of oldspace. I think this is described in a paper by Appel from the mid eighties.
Hmm but then doesn't Squeak record writes rather than use a card table? That would make background collection with a "patch up" after each incremental collect possible.
Bryce
On Thu, 3 Jul 2003, John M McIntosh wrote:
Nope, you forget as I did (and Tim then reminded me) that you just
fall
back to a reference counting GC.
Doesn't that mean you have to start worrying about cycles?
Yes, but you need a mark/sweep/compacting GC to compact things at some point, so you deal with cyclical issues then.
-- ======================================================================== === John M. McIntosh johnmci@smalltalkconsulting.com 1-800-477-2659 Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com ======================================================================== ===
I've put a macintosh 3.5.2b1 VM into the update stream, which will appear on the ftp servers in some future time.
The major change is the usage of GCC 3.3 which improves send performance by 5 or more percent. {Other VM folks who've access to GCC 3.3 on other platforms should take notice}
It also removed about 40K of instructions, so I suspect other improvements exist. I've not run the macro benchmarks but welcome others to do so.
Also I've included my copyLoop changeset that improves drawing performance by about 5%.
======================================================================== === John M. McIntosh johnmci@smalltalkconsulting.com 1-800-477-2659 Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com ======================================================================== ===
I think that John definately deserves a a new G5! But until then, at least folks on old-school (ha!) 32 bit machines can still compile optimized code for the G5 using the dev tools update posted at the ADC.
Regards, Aaron (who can't wait to see what John pulls out of his magic hat for the G5 squeak!)
Actually one of the changes in 3.5.2b1 was to compile an align loops on 16 byte boundary per suggestions from Apple's performance tools. Interestingly enough that produces much more consistent tinybenchmark numbers, right now I"m testing align=32 for loops/functions/jumps/labels per apples recomendations for 32bit G5 application to see what happens.
-- ======================================================================== === John M. McIntosh johnmci@smalltalkconsulting.com 1-800-477-2659 Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com ======================================================================== ===
So I wasn't quite happy with doing say a 5 ms wait. Certainly I can't tell with a bit of testing any difference between 2 & 5ms.
Feedback is welcome.
Ah, well I was out boating this afternoon and realized that the 2 ms wait I'v just put into EventSensor>>nextEventFromQueue isn't a good thing for non-event driven VM. Thus I've change the logic to use nextEventFromQueueNoWait.
Attached is a modified changeset.
-- ======================================================================== === John M. McIntosh johnmci@smalltalkconsulting.com 1-800-477-2659 Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com ======================================================================== ===
This FIX from July of 2003 needs reviewers familiar with the Event system. Note that in BFAV2 the reply from John contains an updated version of the original submission. If the right person spends 5 minutes on this I'm confident that this one can be taken care of. Either that or one of the harvesters should note that this is a submission from John McIntosh, give it a quick glance, go ahead and harvest it, and be ready to backtrack if problems turn up.
See BFAV2 ID 11043.
Ken
Hi john
Have you ever noticed that sometimes the complete windows get redrawn extremely slowly on your mac? We noticed with nathanael that this was related with the use of different processes. I tried several times to reproduce it without success. I noticed that this behavior happens when I get a deprecated method called and once I click proceed, then the redraw gets really slow.
Stef
Nope can't say I've seen that. but let me propose a thought that might help find it.
What i'd suggest is in EventSensor>>processEvent:
either on the interrupt key, or add a new signal say cmd-?
To have it wake up a high priority task and dump all the processor stacks somewhere and other information, say append to a text file. That way when it happens, maybe just maybe you can hit cmd-? a couple of times and capture some information then look for something that looks odd.
ps you have an example, or what you think is an example? That I can try? -- ======================================================================== === John M. McIntosh johnmci@smalltalkconsulting.com 1-800-477-2659 Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com ======================================================================== ===
On Friday 04 July 2003 01:09 am, John M McIntosh wrote:
What i'd suggest is in EventSensor>>processEvent:
either on the interrupt key, or add a new signal say cmd-?
To have it wake up a high priority task and dump all the processor stacks somewhere and other information, say append to a text file. That way when it happens, maybe just maybe you can hit cmd-? a couple of times and capture some information then look for something that looks odd.
I'd love this. Also, I'd like to have a "jump into opposite mode and debug the current process" key: if you're in Morphic, you get to debug the (suspended) Morphic process in MVC, and vice versa...
Nope can't say I've seen that. but let me propose a thought that might help find it.
What i'd suggest is in EventSensor>>processEvent:
either on the interrupt key, or add a new signal say cmd-?
No idea Nathanael spent a lot of time trying to design a minimal program reproducing it.
To have it wake up a high priority task and dump all the processor stacks somewhere and other information, say append to a text file. That way when it happens, maybe just maybe you can hit cmd-? a couple of times and capture some information then look for something that looks odd.
ps you have an example, or what you think is an example? That I can try?
Take a 3.6 alpha open some windows, browser, then invoke a deprecated method such as Smalltalk beep. When you proceed the notifier, you have a big chance to see all the panes getting refresh extremely slowly one after the one ones even those that should not get redrawn.
Stef
--
==== John M. McIntosh johnmci@smalltalkconsulting.com 1-800-477-2659 Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com ======================================================================= ====
ps you have an example, or what you think is an example? That I can try?
Take a 3.6 alpha open some windows, browser, then invoke a deprecated method such as Smalltalk beep. When you proceed the notifier, you have a big chance to see all the panes getting refresh extremely slowly one after the one ones even those that should not get redrawn.
You can almost certainly make it happen by adding "Processor yield" as the first line in Debugger>>resumeProcess: - it all looks as if there's a confusion between which of the processes (the debugged or the active one) is the "UI process" here. I would suspect that the Mac VM (being multi-threaded) issues a signal on the input semaphore which invokes a process switch very similar to adding Processor yield in #resumeProcess:.
Cheers, - Andreas
When you run the FreeCell demo it seems on the powerpc that BitBltSimulation>>alphaSourceBlendBits16 chews a lot of cycles, say 8.7% of the total time.
Well I noticed that in BitBltSimulation>> dither32To16: threshold: that it does something interesting, why it grabs the r g b bytes and mangles them into a 16 bit word. But really at each step we have 8 bits of data with 4 bits for threshold information to result in 4 bits of output.
{Yes, well it's late maybe my math is wrong}
Gee can you say a lookup table (4096 elements of 4 bits (however 8 will do)), yes you can. {actually I'm not sure using a byte is wise, perhaps a word is faster, silly risc machines}
New. dither32To16: srcWord threshold: ditherValue "Dither the given 32bit word to 16 bit. Ignore alpha." | addThreshold | self inline: true. "You bet" addThreshold _ ditherValue bitShift: 8. ^((dither8Lookup at: (addThreshold+((srcWord bitShift: -16) bitAnd: 255))) bitShift: 10) + ((dither8Lookup at: (addThreshold+((srcWord bitShift: -8) bitAnd: 255))) bitShift: 5) + (dither8Lookup at: (addThreshold+(srcWord bitAnd: 255))).
Populating the dither8Lookup (name to change) table is left as an exercise for the reader. This needless to say makes quite a difference and we shave about 20% off our 16 bit alpha blended redraw times, which takes cpu cycle usage to 4.8%.
I'll need to clean this up a bit, and of course if anyone wants to test, please let me know.
Original. dither32To16: srcWord threshold: ditherValue "Dither the given 32bit word to 16 bit. Ignore alpha." | pv threshold value out | self inline: true. "You bet" pv _ srcWord bitAnd: 255. threshold _ ditherThresholds16 at: (pv bitAnd: 7). value _ ditherValues16 at: (pv bitShift: -3). ditherValue < threshold ifTrue:[out _ value + 1] ifFalse:[out _ value]. pv _ (srcWord bitShift: -8) bitAnd: 255. threshold _ ditherThresholds16 at: (pv bitAnd: 7). value _ ditherValues16 at: (pv bitShift: -3). ditherValue < threshold ifTrue:[out _ out bitOr: (value+1 bitShift:5)] ifFalse:[out _ out bitOr: (value bitShift: 5)]. pv _ (srcWord bitShift: -16) bitAnd: 255. threshold _ ditherThresholds16 at: (pv bitAnd: 7). value _ ditherValues16 at: (pv bitShift: -3). ditherValue < threshold ifTrue:[out _ out bitOr: (value+1 bitShift:10)] ifFalse:[out _ out bitOr: (value bitShift: 10)]. ^out -- ======================================================================== === John M. McIntosh johnmci@smalltalkconsulting.com 1-800-477-2659 Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com ======================================================================== ===
"Lex Spoon" <lex@c...> wrote I had the impression that John was seeing actual laggy *behavior*, not just a lot of signals to the semaphore. If there was no observable behavior (short of someone prodding with tracing tools :)) then never mind me.
Yes, what I've seen on the mac was you moved upwards, the cursor would bog down in the stretch area, and you'd continue, and lose the cursor, resulting in the cursor bouncing off the top of the mac menu bar as you've disconnected cursor movement with hand feedback.
A Window user mentioned to me the same, expect for loosing complete control, pegging the cpu and force quitting Squeak to gain control again.
-- ======================================================================== === John M. McIntosh johnmci@smalltalkconsulting.com 1-800-477-2659 Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com ======================================================================== ===
John M McIntosh johnmci@mac.com wrote:
"Lex Spoon" <lex@c...> wrote I had the impression that John was seeing actual laggy *behavior*, not just a lot of signals to the semaphore. If there was no observable behavior (short of someone prodding with tracing tools :)) then never mind me.
Yes, what I've seen on the mac was you moved upwards, the cursor would bog down in the stretch area, and you'd continue, and lose the cursor, resulting in the cursor bouncing off the top of the mac menu bar as you've disconnected cursor movement with hand feedback.
Same thing on Acorn except happens in all directions.
tim -- Tim Rowledge, tim@sumeru.stanford.edu, http://sumeru.stanford.edu/tim Strange OpCodes: DNPG: Do Not Pass Go
Tim Rowledge tim@sumeru.stanford.edu wrote:
John M McIntosh johnmci@mac.com wrote:
"Lex Spoon" <lex@c...> wrote I had the impression that John was seeing actual laggy *behavior*, not just a lot of signals to the semaphore. If there was no observable behavior (short of someone prodding with tracing tools :)) then never mind me.
Yes, what I've seen on the mac was you moved upwards, the cursor would bog down in the stretch area, and you'd continue, and lose the cursor, resulting in the cursor bouncing off the top of the mac menu bar as you've disconnected cursor movement with hand feedback.
Same thing on Acorn except happens in all directions.
I don't see it on Unix. 3.6 image and a pretty current VM. Lex
From: "Andreas Raab" < andreas.raab@g... > Hi Guys,
I was always suspicious about the way CCodeGenerator handled #interpret with respect to temps (e.g., inlining all temps into interpret and randomly renaming them t1 ... tN) as it completely spoils life-time analysis for the C compiler (which has to assume that temps may be read in other code branches and may even "optimize" them into wasting unneeded registers across code branches).
First I was using Change Set: CGeneratorEnhancements-ajh Date: 12 February 2002 Author: Anthony Hannan (ajh18@cornell.edu)
which localized the variables in interpret(), but your change set is a cleaner solution.
I downloaded and setup a new image with SM & loaded the latest VMMaker (or so I thing/thought/believe).
Ran into some issues with the version of VMMaker you used and the current one. Tim and you can sort out what's happening.
TMethod lost an instance variable globalStructureBuildMethodHasFoo and an overwrote a change in TMethod>>setSelector: args: locals: block: primitive:
These two I'm unsure about who's at fault. a) Interpreter lost the class variable BlockMethodIndex b) and the method isUnwindMarked: is missing {Isn't that the block closure stuff?}
Also the two variables in interpret() localReturnContext & localReturnValue end up with no declaration.
Well now because I was using Hannan changeset in earlier work, since 3.2.7b1, the difference is too small/difficult to measure. For the GCC flavor I don't think there was any difference in the code size. (40 bytes smaller for the entire VM, but I was missing the UnwindMarked method, so I think that accounts for the 40 bytes).
For CodeWarrior OS9 there was a 46 byte difference for the interpret() function but any improvement is lost in measurement noise. In the past the reason I used Hannan changeset because it was obvious that codewarrior just gave up doing any useful local variable analyses and stuck the first couple of vars into registers and was stupid... Also this made great improvements in how the 68K version worked with GCC on OpenBSD 3.x
From a note of mine to the list on April 9th, 2002 talking about this:
on a 68k BSD box with GCC the new numbers are {Hannan changeset } 1,614,205 bytecodes/sec and 57,652 sends/sec versus my previous one using the jumptable modification 1,550,387 bytecodes/sec and 55,080 sends/sec versus what I started with 1,439,884 bytecodes/sec and 51,098 sends/sec
So yes the change is good.
---------- PS Another topic In my measurements of the macrobenchmark I see 55.9% is interpret() 4.5% is sweepPhase 5.0% is markPhase 3.0% UpdatePointers (spelling?) 0.9% is incCompMove
Thus 10% lurks in the mark/sweep phase of the GC.
Fidding with ObjectMemory>>startField can be measured in the tinybenchmarks. I'm considering check for type 0, else type = 2, otherwise it's a small Integer. That becomes a load with set condition, a branch on condition, a check against 2 and a branch on condition. This improves macrobenchbenchmark by 2%, but degrades then tinybenchmark because of the integers it creates. MMM a case statement! might be useful here...
-- ======================================================================== === John M. McIntosh johnmci@smalltalkconsulting.com 1-800-477-2659 Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com ======================================================================== ===
John,
Here are versions that work with 3.6 latest. I hadn't noticed that some things changed in VMMaker itself. This also makes the Interpreter changes work right.
Cheers, - Andreas
PS. What kind of variable name is "globalStructureBuildMethodHasFoo"??? ;-)
-----Original Message----- From: squeak-dev-bounces@lists.squeakfoundation.org [mailto:squeak-dev-bounces@lists.squeakfoundation.org] On Behalf Of John M McIntosh Sent: Monday, July 07, 2003 11:16 AM To: The general-purpose Squeak developers list Subject: ENH][VM] Improved code generation (hopefully ;)
From: "Andreas Raab" < andreas.raab@g... > Hi Guys,
I was always suspicious about the way CCodeGenerator handled #interpret with respect to temps (e.g., inlining all temps into interpret
and randomly
renaming them t1 ... tN) as it completely spoils life-time
analysis
for the C compiler (which has to assume that temps may be read in other code branches and may even "optimize" them into wasting unneeded
registers
across code branches).
First I was using Change Set: CGeneratorEnhancements-ajh Date: 12 February 2002 Author: Anthony Hannan (ajh18@cornell.edu)
which localized the variables in interpret(), but your change set is a cleaner solution.
I downloaded and setup a new image with SM & loaded the latest VMMaker (or so I thing/thought/believe).
Ran into some issues with the version of VMMaker you used and the current one. Tim and you can sort out what's happening.
TMethod lost an instance variable globalStructureBuildMethodHasFoo and an overwrote a change in TMethod>>setSelector: args: locals: block: primitive:
These two I'm unsure about who's at fault. a) Interpreter lost the class variable BlockMethodIndex b) and the method isUnwindMarked: is missing {Isn't that the block closure stuff?}
Also the two variables in interpret() localReturnContext & localReturnValue end up with no declaration.
Well now because I was using Hannan changeset in earlier work, since 3.2.7b1, the difference is too small/difficult to measure. For the GCC flavor I don't think there was any difference in the code size. (40 bytes smaller for the entire VM, but I was missing the UnwindMarked method, so I think that accounts for the 40 bytes).
For CodeWarrior OS9 there was a 46 byte difference for the interpret() function but any improvement is lost in measurement noise. In the past the reason I used Hannan changeset because it was obvious that codewarrior just gave up doing any useful local variable analyses and stuck the first couple of vars into registers and was stupid... Also this made great improvements in how the 68K version worked with GCC on OpenBSD 3.x
From a note of mine to the list on April 9th, 2002 talking about this:
on a 68k BSD box with GCC the new numbers are {Hannan changeset } 1,614,205 bytecodes/sec and 57,652 sends/sec versus my previous one using the jumptable modification 1,550,387 bytecodes/sec and 55,080 sends/sec versus what I started with 1,439,884 bytecodes/sec and 51,098 sends/sec
So yes the change is good.
PS Another topic In my measurements of the macrobenchmark I see 55.9% is interpret() 4.5% is sweepPhase 5.0% is markPhase 3.0% UpdatePointers (spelling?) 0.9% is incCompMove
Thus 10% lurks in the mark/sweep phase of the GC.
Fidding with ObjectMemory>>startField can be measured in the tinybenchmarks. I'm considering check for type 0, else type = 2, otherwise it's a small Integer. That becomes a load with set condition, a branch on condition, a check against 2 and a branch on condition. This improves macrobenchbenchmark by 2%, but degrades then tinybenchmark because of the integers it creates. MMM a case statement! might be useful here...
--
==========
John M. McIntosh johnmci@smalltalkconsulting.com 1-800-477-2659 Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com
==============================================================
===
From: "Andreas Raab" <andreas.raab@g...> ... I set up a little benchmark (which is attached) that runs the same bytecodes with some re-ordering (applied to maximize/minimize the effect of the dispatch) and run it. Here are the results:
iterations well-predicted(msecs) mispredicted(msecs) 10 0 0 100 0 0 1000 0 1 10000 3 7 100000 24 68 1000000 247 673 10000000 2463 6722
The interesting fact of the matter is that we spend about 2.7 times as much when we have branch mispredictions as without them. I'd be delighted if someone could review the benchmarks itself in order to ensure that I haven't done anything wrongly and that the results itself are valid.
My numbers using mac vm 3.5.2b1 (500Mhz G3) are:
10 0 0 100 0 0 1000 1 1 10000 8 7 100000 77 78 1000000 797 781 10000000 7817 7861
again 10 0 0 100 0 0 1000 1 1 10000 15 8 100000 78 77 1000000 786 776 10000000 7809 7819
Perhaps someone else can confirm these?
-- ======================================================================== === John M. McIntosh johnmci@smalltalkconsulting.com 1-800-477-2659 Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com ======================================================================== ===
Hi John,
My numbers using mac vm 3.5.2b1 (500Mhz G3) are:
10 0 0 100 0 0 1000 1 1 10000 8 7 100000 77 78 1000000 797 781 10000000 7817 7861
Just curious: Do you "gnuify" the VM before compiling it[*]? The above looks as if you might still be using the switch-based dispatch (in which case there should be no difference between the two versions as the branch prediction will _always_ be wrong no matter how you arrange the bytecodes). If you don't gnuify it, I'd recommend doing so - like I said it bought me a factor of two in speed.
[*] The important part of gnuification here is to replace the switch/break statements with labeled gotos which are supported by GCC but not other compilers. The gnuify script also allows explicit assignment of registers for certain variables (like localIP/localSP/currentBytecode) and I am almost certain that fiddling around with this should be helpful for PPC too.
Cheers, - Andreas
Just curious: Do you "gnuify" the VM before compiling it[*]? The above looks as if you might still be using the switch-based dispatch (in which case there should be no difference between the two versions as the branch prediction will _always_ be wrong no matter how you arrange the bytecodes).
Ooops. Ok, here are the numbers from CodeWarrior version (os-9) for 3.5.2b1 It's not gnuified and the declare reads like so int interpret(void) { register struct foo * foo = &fum; int localHomeContext; char* localSP; char* localIP; int currentBytecode; int localReturnContext; int localReturnValue;
so everything is left up to the code warrior compiler to decide.
10 0 0 100 0 0 1000 1 1 10000 11 10 100000 104 109 1000000 1032 1025 10000000 10282 10207 10 0 0 100 0 1 1000 1 1 10000 11 9 100000 111 100 1000000 1025 1020 10000000 10261 10198
======================================================================== === John M. McIntosh johnmci@smalltalkconsulting.com 1-800-477-2659 Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com ======================================================================== ===
*] The important part of gnuification here is to replace the switch/break statements with labeled gotos which are supported by GCC but not other compilers.
Ah, well here is a thought that might send you away to think because the PC folks who are complain about the G5 SPEC marks say the GCC compiler creates crummy code on intel, versus say some other compiler... so any usage of GCC on intel is flawed...
Although labeled gotos are a GCC feature, why we forget that one can take an address at runtime.
So if the code looks like below, which comes from I note I sent to CodeWarrior technical support back in Jan 2001 then I think you can calculate the label addresses at runtime, or based on a one time if statement for the primitive switch statement, then you get to invoke the goto. The reason for this note to tech support is that doing this in codewarrior pre 8.x causes an internal compiler failure. I've not got codewarrior 8.x so I've not been able to confirm the bug was fixed.
PS If anyone has codewarror 8.x on the mac and would like to attempt a compile, please let me know and I'll send you a 3.0.22 mac vm folder to fiddle with.
However you might want to try this with intel's? or Microsoft's compiler and see how the code compares to GCC?
Our GNU friends have a modification to the bytecode loop which is a 256 case statement such that
jumptable[0] = &&_0; ... jumptable[255] = &&_255;
currentByteCode = byteAt(++localip); while (true) { switch currentByteCode 0: _0: // stuff currentByteCode = byteAt(++localip); goto jumptable[currentByteCode]; 1: _1: ... end switch }
This has been logged in our database as bug number WB1-29763.
Note that the gnuified code macros CASE(0) BREAK; hide the actual 0:_0 and goto jumptable[currentByteCode]. These are explicit here to be helpful to the metrowerks staff.
-- ======================================================================== === John M. McIntosh johnmci@smalltalkconsulting.com 1-800-477-2659 Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com ======================================================================== ===
Ah, well here is a thought that might send you away to think because the PC folks who are complain about the G5 SPEC marks say the GCC compiler creates crummy code on intel, versus say some other compiler... so any usage of GCC on intel is flawed...
Well, it depends on what you're doing. I found that for "real life use" in Squeak GCC beats any other compiler (assuming you supply proper gnuification of course). Given that I only care about the bottom line I don't care very much about what people say as long as I can prove them wrong ;-)
After all, if they got a bunch of primitives that run significantly faster using a different compiler they're absolutely free to use that compiler for it. But for the main interpreter loop I have yet to find anything that gets even close to GCC (as measured by both tinyBenchmarks and macroBenchmarks). MSVC is pretty good (Borland-C or whatever it's called today used to be a total mess) but looses due to the lack of threaded dispatch and buying the Intel C compiler seems like a waste of money to me all pros and cons considered (if someone out there has it I'd be interested in seeing if there are any improvements).
Cheers, - Andreaas
John, one question about your results: How do they measure up with the results coming from #tinyBenchmarks? I don't think there are very many bytecodes involved either so (in theory) the results obtained should be exactly in line.
Cheers,
- Andreas
Hi, I'm not sure what you are asking for here?
I see someone talked about the branch prediction logic in powerpc versus intel. I'll note that apple once had a tool to instrument a program, then run and collect branch statistics. This data was then reapplied to the binary to set tag bits on branches because you can indicate which way a branch should usually go as part of the instruction. (a hint)
For the 604e I found this made a measurable improvement for the squeak VM. However for the G3 the gain was far less because the branch predictor logic was so much better. I've not gone back to visit that, besides it's another time consuming step in the build process. However I note some discussion of this issue for the G5, so we'll see.
-- ======================================================================== === John M. McIntosh johnmci@smalltalkconsulting.com 1-800-477-2659 Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com ======================================================================== ===
John, one question about your results: How do they measure up with the results coming from #tinyBenchmarks? I don't think there are very many bytecodes involved either so (in theory) the results obtained should be exactly in line.
Hi, I'm not sure what you are asking for here?
This:
How does this compare to the overall bytecode speed we measure through, e.g., something like #tinyBenchmarks? The benchmarks contain 45 bytecodes in the loop (hand-counted; again someone please double-check and correct me) which means we have 450,000,000 bytecodes in 2.463 secs (well predicted) vs. 6.722 (mispredicted); resulting in:
(450000000 / 2.463) truncated asStringWithCommas => 182,704,019 bytecodes/sec (450000000 / 6.722) truncated asStringWithCommas => 66,944,361 bytecodes/sec
Comparing this to the measure obtained by #tinyBenchmarks on the same machine
117,323,556 bytecodes/sec 3,377,222 sends/sec
Cheers, - Andreas
On Monday 07 July 2003 11:40 am, Bilal Ahmed wrote:
Has anyone looked at voice recognition in Squeak? I'd appreciate any direction. Thanks!
You might start by looking at the class category "Speech-Phoneme Recognizer".
I don't know anything about the status of this work, though.
Perhaps John McIntosh could comment on it.
-- Ned Konz http://bike-nomad.com GPG key ID: BEEA7EFE
Ah, that would be John Maloney, (jm) not me (jmm)
-- ======================================================================== === John M. McIntosh johnmci@smalltalkconsulting.com 1-800-477-2659 Corporate Smalltalk Consulting Ltd. http://www.smalltalkconsulting.com ======================================================================== ===
squeak-dev@lists.squeakfoundation.org