Hi,
I found on the swiki reference to Hans-Martin Mosner's extra bit for tagging object pointers.
http://www.heeg.de/~hmm/squeak/2tagbits/
This must have been discussed in the past. I wish to renew that discussion as I think it is an exceptionally good idea. Perhaps H-M M himself will give his current position on this. The following is what I make of it.
Description:
The details can be elaborated in different ways here's one (showing the two least significant bits):
10 - small integer 00 - pointer 01 - special i 11 - special ii
Small integers works as previously but with one bit less. This will probably cause some solvable problems.
The interesting part is special i/ii, I propose that they are used as follows:
byte3 byte2 byte1 - together forms 24 bits giving 16M values. byte0 - bit 0 is constant = 1, 7 bits left gives 128 values used as tags.
So we have 128 tags, if used wisely that is a huge possibility.
Here are some possibilities:
Characters ascii/lf, ascii/cr, ascii/crlf, iso-xxxx-1, utf-8, utf-16, .., home-brew-1, .. consider for example one tag meaning the present character set (ascii/cr) with extra info for font, style, size, color.
Standard Classes - a well chosen set of essential classes, can be easily accessed and communicated. This should include Object, Symbol and many such and also the ParseNode-hierarchy and similar. Ansi (and other well established) protocols - all standard interfaces can be cataloged in this form and easily communicated bytecodes - the normal byte code set can be considered numeric code/symbol at the same time widget family primitive methods special (simple) methods (projections, many others) a nomenclature for (C-like) types tightly packed structures html tags Prolog-like variables and other things with special "roles in the system" other important (closed) coding systems, midi, vrml many other possibilities exist
Mosner gives an example that in this version would give 12-bit coordinates for Point.
This can be considered universal (cross image) pointers. Things that are lifted above gc (global tenure). It is a good help in communicating with plugins, providing a rich language independent of gc.
Some of the above I have experience with. No doubt others will find intereseting uses of the idea if it gets available.
The above is a bit cryptic and very incomplete, bottom line:
This is a good thing, lets reimplement it. If Hans-Martin Mosner will do it good, if not I will. If it is wanted that is. An immediate gain is for different character sets including large ones, 24M no problem, the current way of handling characters doesn't scale.
Note also that this idea will be even better on 64-bit machines that will appear sooner or later. One would then have an immense set of interesting values liberated from the burden of gc.
/Mats
Cool!
I had been wondering for a while if the extended tag scheme would work. Any VM implementers/object memory specialists want to weigh in?
Cheers, Bijan Parsia.
Bijan Parsia bparsia@email.unc.edu is widely believed to have written:
Cool!
I had been wondering for a while if the extended tag scheme would work. Any VM implementers/object memory specialists want to weigh in?
VW has used two tag bits for years. IIRC it uses three of the patterns:- smallinteger oop character
The cost is not just the reduction in range of SmallInteger - after all the difference is pretty small - but the complexity of the tag checking. at the moment, checking for SmallInteger/oop is a single test. Two tag bits makes deriving the class more complex (check for SI, check for other option, remainder is normal object) and affects any code that needs to understand the class of the objects involved.
A restriction is that this is only really useful for manifest constant objects; yes, you could use the data bits to index a list of classes for example, but how is that an advantage over having the oop of the class? Immediate Points, restricted range floats, colour values, anything where the bits is the data, would all be plausible. Remember that such manifest objects cannot be altered any more than a SmallInteger can, so quite a bit of code in the image would be affected; for example you wuld have to write pt1 := pt1 x @ (pt1 y *2) instead of pt1 y: (pt1 y * 2) ... which is a poor example since I think I would prefer it anyway, but you're smart enough to see what I mean.
So, yes it might be useful in some sense but I rather suspect the runtime costs are unpleasant, especially once you go past a simple VW like form.
tim
This is very interesting!
Has this been tried in other implementations before?
Does this essentially mean that up to 128 immediate classes can be added, each with up to 16M unique instances?
Since Squeak uses direct pointers and addresses are typically aligned on 4 byte boundaries, it seems like that bit is being wasted anyway, except for the SmallInteger case. It doesn't seem like it should be that big of an impact on performance either (but maybe so, have any benchmarks been run?).
With regard to things such as Unicode (and other character sets), if you tried to create a string out of OOPS (in some sort of collection for instance), you would be using up twice as much space as the character set (in the case of Unicode anyway) required. But, alternatively, you could have a variable word class that efficiently stores the Unicode (i.e. UnicodeString)...when accessing a single character in the string, it would be a simple process to tack on the extra two (well-known) bytes that would form the immediate OOP for the corresponding Unicode character.
The ability to efficiently represent just about any encoding scheme is reason enough in my opinion.
I think it's definitely worth exploring! You have my vote.
- Stephen
-----Original Message----- From: Mats Nygren [mailto:nygren@sics.se] Sent: Friday, September 01, 2000 7:35 AM To: squeak@cs.uiuc.edu Cc: nygren@sics.se; hm.mosner@cww.de; hmm@heeg.de Subject: The Mosner bit
Hi,
I found on the swiki reference to Hans-Martin Mosner's extra bit for tagging object pointers.
http://www.heeg.de/~hmm/squeak/2tagbits/
This must have been discussed in the past. I wish to renew that discussion as I think it is an exceptionally good idea. Perhaps H-M M himself will give his current position on this. The following is what I make of it.
Description:
The details can be elaborated in different ways here's one (showing the two least significant bits):
10 - small integer 00 - pointer 01 - special i 11 - special ii
Small integers works as previously but with one bit less. This will probably cause some solvable problems.
The interesting part is special i/ii, I propose that they are used as follows:
byte3 byte2 byte1 - together forms 24 bits giving 16M values. byte0 - bit 0 is constant = 1, 7 bits left gives 128 values used as tags.
So we have 128 tags, if used wisely that is a huge possibility.
Here are some possibilities:
Characters ascii/lf, ascii/cr, ascii/crlf, iso-xxxx-1, utf-8, utf-16, .., home-brew-1, .. consider for example one tag meaning the present character set (ascii/cr) with extra info for font, style, size, color.
Standard Classes - a well chosen set of essential classes, can be easily accessed and communicated. This should include Object, Symbol and many such and also the ParseNode-hierarchy and similar. Ansi (and other well established) protocols - all standard interfaces can be cataloged in this form and easily communicated bytecodes - the normal byte code set can be considered numeric code/symbol at the same time widget family primitive methods special (simple) methods (projections, many others) a nomenclature for (C-like) types tightly packed structures html tags Prolog-like variables and other things with special "roles in the system" other important (closed) coding systems, midi, vrml many other possibilities exist
Mosner gives an example that in this version would give 12-bit coordinates for Point.
This can be considered universal (cross image) pointers. Things that are lifted above gc (global tenure). It is a good help in communicating with plugins, providing a rich language independent of gc.
Some of the above I have experience with. No doubt others will find intereseting uses of the idea if it gets available.
The above is a bit cryptic and very incomplete, bottom line:
This is a good thing, lets reimplement it. If Hans-Martin Mosner will do it good, if not I will. If it is wanted that is. An immediate gain is for different character sets including large ones, 24M no problem, the current way of handling characters doesn't scale.
Note also that this idea will be even better on 64-bit machines that will appear sooner or later. One would then have an immense set of interesting values liberated from the burden of gc.
/Mats
Hi Mats & others, the idea is actually not mine; I copied it almost verbatim from ParcPlace Smalltalk (now Cincom VisualWorks). VW has the following combinations:
00 object pointer 01 character 10 forbidden 11 small integer
This has the big advantage that the tag bits have individual semantics: bit 0 is the "immediate" bit, and bit 1 is the "small integer" bit. So to check whether something is a SmallInt, you just have to check one bit. In a similar vein, to check whether an oop is a legal pointer into memory and can be dereferenced, you also have to check just one bit. Since both of these operations are fairly common in the VM, they should be as cheap as possible, and on most architectures multi-bit tests are a bit more expensive than single-bit tests. In Squeak, there is the additional complication that the 10 tag bit combination is used by the GC machinery to detect object headers when walking through pointer objects. It works by scanning from the end of each pointer object and writes the 10 tag combination into the last header word first. When it finds that tag, it knows that it has reached the beginning of the object. The tag can then be reconstructed from other header information. What my approach does differently from ParcPlace's is that 01 is just the tag for immediate non-integer objects. What you put into that 30-bit space is pretty much your business, but there are some constraints:
* It better had to be immutable data... * In an object table implementation, these objects can not participate in a become: operation. For Squeak with its direct pointers it does not matter.
Since I'm currently focusing on getting SCAN into gear, I would like if somebody else could take over work on the 2-bit front. The old code is there to pilfer, and I would certainly engage in discussion about what should be implemented and what not. Specifically the networking aspects of Squeak need to be considered, which is radically different from the situation when I first implemented the stuff. Perhaps the 3D graphics guys should have a word, too.
Cheers, Hans-Martin
squeak-dev@lists.squeakfoundation.org