Hi--
This is another call for feedback on the design of Naiad[1], a Smalltalk module system I'm writing for Squeak as part of the Spoon project[2].
On the theory that I'll get more of a response by including the whole text rather than a link to it, here it is... :)
***
2008-10-20, 1946 GMT
Copyright (c) 2008 Craig Latta. All rights reserved.
Hi--
I've been on a quest to make Squeak smaller and more modular, the Spoon project[1]. Part one was making the object memory small. Part three is about making the virtual machine small. This message is about part two, making a module system suitable for adding new behavior to a minimal system in an organized way, and for transferring behavior accurately between running systems.
Spoon's module system is called "Naiad", which is an acronym for "Name And Identity Are Distinct". It keeps track of the development history of a system (what the "sources" and "changes" files are for now), and makes it available for exchange with other systems. I think keeping classes' names and identities separate is critical for this. Following are some notes on its design and use, including the object model[1].
At this point I'd like to emphasize I am the author of this design, that I intend to release its implementation under an MIT-style license, and that I'd like to pursue a graduate degree with it (I'm open to invitations :).
***
motivation
A traditional Smalltalk system uses source code to express both development history and changes exchanged between systems. The precise meaning of source code depends on the current state of the system compiling it. Since a Smalltalk system is dynamic, source code is an inherently ambiguous medium across time.
The most problematic system artifacts in light of this ambiguity are classes. All activity in a Smalltalk system is the result of sending messages to objects. The sending of a message invokes the execution of a method, a sequence of instructions for a virtual processor. Some of these instructions manipulate the state of the object receiving the message. Classes define the structure of that state. Therefore, when those class definitions change, the source code for the methods of those classes may become meaningless.
One may confront this situation when trying to recompile source code for an old version of a method whose class definition has changed in the meantime. Similarly, source code from one system may not be meaningful on another, since corresponding class definitions on each system may change independently (or be removed entirely).
This means that the accurate exchange of behavior requires manual labor, hindering the propagation of useful fixes and new code. It also means that interpretation and use of historical code is more difficult than necessary. So we pay twice for this problem: when learning the system, and when trying to share our work with others. By separating class name from identity, Naiad makes Smalltalk more approachable for newcomers, and more productive for developer and user communities.
editions
Using Naiad, each development system consists of two object memories: one containing developed code, and another containing "editions" which describe that code. I'll call the first one the "subject memory" and the other the "history memory".
An Edition is a description of some artifact in the subject memory at some point in time, currently an author, comment, tag, class, method, module, checkpoint, or edit. Each edition has a reference to that artifact's next state in the future (the next edition) and in the past (the previous edition), as well as an author edition, a collection of licenses, and a timestamp.
An Edit represents the activation of some edition at a point in time. For example, there may be a method created in 2005 that is removed in 2006 and reactivated in 2007. There would be an Edit for each of those three events, but only two method editions (one representing the method becoming active, and one representing it being removed).
The history memory replaces the current changes and sources files. It has an instance of EditHistory corresponding to the subject memory, which records the active (current) editions for the classes, method, modules, and authors in the subject memory. It also keeps the subject memory's id and the last Edit made to the subject memory.
Every time the subject memory adds, changes, or removes a class definition, method, author, comment, tag, or module, or makes a checkpoint (i.e., makes an edit), it adds the appropriate editions to the history memory via remote messages. The history memory snapshots itself after every edit, so as to provide crash recovery support.
The subject memory keeps a remote reference to the history memory's instance of EditHistory as a class variable of the local EditHistory class, and interacts with it using utility messages sent to the local EditHistory class. The history memory also keeps that EditHistory instance as a class variable of its local EditHistory class, but as a local reference.
An edition typically elides some of its references when it is transferred out of a history memory. For example, a transferred edition will usually omit the references to its next and previous editions. The requesting subject memory can calculate the ID of those editions and obtain them with a separate request, if necessary.
A subject memory may elect to keep its EditHistory instance as a local object, such as in a situation where one wants some limited immutable history for debugging purposes, and no crash recovery support. Whether in this scenario or in normal development the same EditHistory utility messages suffice, since no special code need be written to support remote objects. If no edits will be made during deployment, and no history retrieval is required, one may simply jettison the history memory. One may always reconnect the subject and history memories at a later time and continue development.
The subject memory has tools for browsing and activating the editions, wherever they are located. This means that no special tools are needed to browse the artifacts of multiple subject systems; one uses the same tools as for browsing the artifacts of the local subject memory. Each subject memory may connect to multiple history memories concurrently (if allowed).
For that matter, the history memories of multiple systems may connect to each other directly, to aggregate editions from multiple people, for example.
class and method IDs
Each class in the subject memory has a universally-unique identifier[3], or UUID. The classes in the minimal subject memory are assigned UUIDs before the initial release, and all subsequent classes are assigned UUIDs when created. Rather than use the single word "class" to refer to either a metaclass or to its sole instance, Spoon introduces the term "protoclass". For example, (Array class) is a metaclass, and its sole instance, Array, is a protoclass. Each metaclass and protoclass has its own UUID, called a "base ID". This is supported by a new instance variable in ClassDescription.
Each version of each class is identified by a ClassID, a byte array with segments for the class's baseID, author UUID, and a sixteen-bit version. This means we can uniquely identify, for each author, 65,535 versions of each class in the system. Since we identify authors by UUID, the number of possible authors is very large.
Each version of each method is identified by a MethodID, a byte array which contains a ClassID and segments for the method's selector, author UUID, and a sixteen-bit version. This means we can uniquely identify, for each author, 65,535 versions of each method in each version of each class in the system.
method editions and method literal markers
Each MethodEdition holds a reference to the corresponding ClassEdition, the method source code, and the information needed to reconstruct the corresponding CompiledMethod directly, without need of the compiler (the method header, initial and final program-counter values, method literal markers, and instructions). If one will never use the history memory to install methods in a subject memory that lacks a compiler, one could drop the compiled method information to save space.
Method literal markers are used to transmit a compiled method's literal frame values between object memories. There are method literal marker classes to support references to classes, class variables, other pool variables, and literal objects, and to support methods which perform class-side super-sends. Each method literal marker instance knows how to serialize itself as part of Spoon's remote messaging system. In particular, when a method literal that refers to a class transmits itself, it transmits the ClassID of that class, not the name of the class.
This gets at the namesake concept of Naiad, "Name And Identity Are Distinct". When referring to a class, we never need to use its name. Each version of each class is an object with a distinct identity. By using ClassIDs to refer to each of them, we can avoid using class names at all when storing history or distributing code. This means that name of each class can be anything, as far as the system is concerned.
With every class name unconstrained, there is no need for "namespaces" to distinguish between classes which happen to have same name at some point in time. Each class effectively has its own namespace, since it is uniquely identifiable regardless of its name.
Developer tools armed with this information can resolve ambiguity for humans browsing and changing the system. If a developer writes a method which uses a name shared by multiple classes, the system can present more information about each of those classes (such as the author, time of creation, version, and module association), so that the developer can choose the intended one. When browsing such a method, the system can distinguish the aliased class name visually, indicating that there is disambiguating information available.
class editions and shared variables
Each ClassEdition holds the editions for all the method versions currently active in the corresponding class in the subject memory. Since every edition keeps a reference to its previous and next editions, one can trace the history of any method by starting at the active edition. Removed methods are represented by method editions which have the same MethodID as a normal previous method edition, but with the rest of the fields set to nil.
Each ClassEdition also holds the information needed to reconstruct the corresponding class directly, without need of the class builder. For all classes, this includes the format, instance variable names, and superclass ID. For protoclasses, it also includes the class pool keys, class name, and received pool IDs.
In Spoon, every shared variable pool is the responsibility of some class in the system. There is no global variables pool ("system dictionary"). Each class that defines a pool is said to "publish" that pool; classes which use that pool "receive" it. Spoon adds an instance variable to Class to map published pools to their names. Each ReceivedPoolID that a protoclass edition uses is a byte array which contains a class ID and a published pool name.
checkpoints and modules
A Checkpoint edition is simply a named marker of a particular point in time. A developer may use checkpoints to indicate various interesting states of development, and use the tools to regress or replay edits made before or after that time.
The largest unit of work is represented by module editions. They are named collections of method IDs, indicating the specific versions of methods which comprise a module, along with sets of child, parent, prerequisite, and postrequisite module editions. When a module edition is transferred out of a history memory, those edition references are transmitted as ModuleIDs. Each module edition also has an "antimodule", a module edition calculated at installation time by a receiving system which, if applied, would undo the changes made by installing the original module. Finally, each module edition has a URI by which someone at a remote site may install the module.
That URI represents a command to a Spoon system running on a requestor's local machine; it refers to a standard port on localhost. Its path is a text-encoded action, containing an instruction (in this case "install a module"), the hostname and port of a Spoon system providing the module, and the module's ID. The receiving system uses this information to request the module from a providing history memory, which then transmits editions as necessary. Exactly which editions are transmitted depends on the state of the receiving system; this is a two-way conversation between the providing and receiving systems. This is often more time and space efficient than simply providing all of a module's code, which is what happens with traditional static representations like change sets.
The URIs may be cited on ordinary webpages, which are indexed by search engines like Google. A person in search of a module for a particular purpose can search for it with a web browser, using those search engines. Having found a module's URI, the person can click on it, establishing a connection to an embedded webserver in their local Spoon system, which carries out the URI's command.
This mechanism for code distribution avoids storing code in static files. It's a deparature from Smalltalk's traditional "fileout" mechanism.
The encoded URIs can serve other functions as well, such as listing a system's installed modules, removing an installed module, making a snapshot, and quitting the system. In this way one can use a web browser to interact with a Spoon system for several basic tasks; this is especially useful when the system is headless (e.g., in its initial minimal state).
comments and tags
Editions for authors, classes, methods, checkpoints, edits, and modules each have their own comment and tag editions. This means each one of those artifacts has a comment and tags, and the changes in both are recorded over time. Comments are as we've already been using them: they're explanatory prose about the artifacts. Tags may be familiar to you from the web; they are short semantic markers used for grouping similar artifacts.
I intend for tags to replace class and method categories. Nominally, we've been using class and method categories to establish semantic hierarchies, but the hierarchies have turned out to be quite shallow. Although we can form hierarchies with tags as well, I think we would do better to apply the sorts of algorithms that search engines use, and not concern ourselves with memorizing an artifact's semantic markers. The computational cost this incurs for the tools might have been high in the early days of Smalltalk, but it is quite modest now.
Thanks for reading! Please let me know of any questions or other feedback, and feel free to discuss this on the Spoon and Squeak-dev mailing lists.
-C
[1] http://netjam.org/spoon/naiad [2] http://netjam.org/spoon [3] http://en.wikipedia.org/wiki/Universally_Unique_Identifier
-- Craig Latta improvisational musical informaticist www.netjam.org Smalltalkers do: [:it | All with: Class, (And love: it)]
On Wed, Nov 19, 2008 at 12:27:51PM -0800, Craig Latta wrote:
This is another call for feedback on the design of Naiad[1], a
Smalltalk module system I'm writing for Squeak as part of the Spoon project[2].
I have a suspicion that this rewrite has sped up the source code access speed dramatically, which should make the browser much snappier and Monticollo snapshotting much faster. Could you comment on a rough estimate of this speedup? I think it is a feature worth advertising.
Hi Matthew--
I have a suspicion that this rewrite has sped up the source code access speed dramatically, which should make the browser much snappier and Monticollo snapshotting much faster. Could you comment on a rough estimate of this speedup? I think it is a feature worth advertising.
Good point... I haven't done any measurements yet, though. This is effectively comparing localhost network access from RAM to filesystem access. I imagine it would come down to which way invoked more paging on the part of the host system's memory management. My guess is that the network case would be faster, but not amazingly so these days.
You experience unduly slow speeds with the file-based case now?
thanks again,
-C
On Wed, Nov 19, 2008 at 07:08:12PM -0800, Craig Latta wrote:
Hi Matthew--
I have a suspicion that this rewrite has sped up the source code access speed dramatically, which should make the browser much snappier and Monticollo snapshotting much faster. Could you comment on a rough estimate of this speedup? I think it is a feature worth advertising.
Good point... I haven't done any measurements yet, though. This is
effectively comparing localhost network access from RAM to filesystem access. I imagine it would come down to which way invoked more paging on the part of the host system's memory management. My guess is that the network case would be faster, but not amazingly so these days.
You experience unduly slow speeds with the file-based case now?
No. It's not the filesystem access that is slow. It is that the current source code storage format reads the file character by character, looking for the terminating !, and doing utf-8 conversion the whole way. 95% of the time in a MC snapshot is spent in testing the source file characters for the terminating ! character. I immagine you do much more in bulk rather than character by character. The format of the .sources file is horrible for access speed.
It's not the filesystem access that is slow. It is that the current source code storage format reads the file character by character, looking for the terminating !, and doing utf-8 conversion the whole way. 95% of the time in a MC snapshot is spent in testing the source file characters for the terminating ! character. I imagine you do much more in bulk rather than character by character. The format of the .sources file is horrible for access speed.
Oh, right! I forgot about that. :) Yeah, should be quite a bit faster, I'm just answering String objects that are sitting in memory. The measurements will be interesting.
But again, I take it from your comment that you're actually finding the speed of the traditional setup to be a problem? If so, then yeah, it'd make a good marketing point.
thanks again,
-C
On Wed, Nov 19, 2008 at 09:48:38PM -0800, Craig Latta wrote:
It's not the filesystem access that is slow. It is that the current source code storage format reads the file character by character, looking for the terminating !, and doing utf-8 conversion the whole way. 95% of the time in a MC snapshot is spent in testing the source file characters for the terminating ! character. I imagine you do much more in bulk rather than character by character. The format of the .sources file is horrible for access speed.
Oh, right! I forgot about that. :) Yeah, should be quite a bit
faster, I'm just answering String objects that are sitting in memory. The measurements will be interesting.
But again, I take it from your comment that you're actually
finding the speed of the traditional setup to be a problem? If so, then yeah, it'd make a good marketing point.
It's by far the biggest bottleneck in MC speed. Try viewing changes on the Morphic package. It takes about 3 minutes on a fast machine, and 99.6% of the time is source code lookup, last time I measured Morphic
On Thu, Nov 20, 2008 at 08:57:40AM -0700, Matthew Fulmer wrote:
On Wed, Nov 19, 2008 at 09:48:38PM -0800, Craig Latta wrote:
It's not the filesystem access that is slow. It is that the current source code storage format reads the file character by character, looking for the terminating !, and doing utf-8 conversion the whole way. 95% of the time in a MC snapshot is spent in testing the source file characters for the terminating ! character. I imagine you do much more in bulk rather than character by character. The format of the .sources file is horrible for access speed.
Oh, right! I forgot about that. :) Yeah, should be quite a bit
faster, I'm just answering String objects that are sitting in memory. The measurements will be interesting.
But again, I take it from your comment that you're actually
finding the speed of the traditional setup to be a problem? If so, then yeah, it'd make a good marketing point.
It's by far the biggest bottleneck in MC speed. Try viewing changes on the Morphic package. It takes about 3 minutes on a fast machine, and 99.6% of the time is source code lookup, last time I measured Morphic
I mention it because I've spent a lot of time optimizing MC for speed. Between MC 1.0 and 1.6, I've improved PackageInfo speed 8x and package loading speed 6x (and I have another unstable experimental patch that reduces loading from an quadratic time operation to linear time one). However, I havn't been able to touch package saving speed at all, because it would require changing the source code storage format, which is outside the scope of MC.
Hi,
I didn't follow he whole discussion but read Craigs summary. There are two things I want to know:
Theoretically I like a lot of the ideas. Practically I don' want to deal with two things (subject and history). It reads a lot like I need to start two different images to be able to start working. Are there ideas how to hide the fact that there are two things involved? Or is there even the possibility to have the history memory and the subject memory in one thing? Don't get me wrong I really like the idea to have these separated. But I want to separate them later not at first.
I like to edit my history (and also remove versions). Is there a definition of a fallback behaviour of the history memory if a version is missing?
thanks,
Norbert
I would think that Hyrdra would be a good way to simplify the management of two object memories for the typical use-case. -david
On Sat, Nov 22, 2008 at 3:02 AM, Norbert Hartl norbert@hartl.name wrote:
Hi,
I didn't follow he whole discussion but read Craigs summary. There are two things I want to know:
Theoretically I like a lot of the ideas. Practically I don' want to deal with two things (subject and history). It reads a lot like I need to start two different images to be able to start working. Are there ideas how to hide the fact that there are two things involved? Or is there even the possibility to have the history memory and the subject memory in one thing? Don't get me wrong I really like the idea to have these separated. But I want to separate them later not at first.
I like to edit my history (and also remove versions). Is there a definition of a fallback behaviour of the history memory if a version is missing?
thanks,
Norbert
Spoon mailing list Spoon@lists.squeakfoundation.org http://lists.squeakfoundation.org/cgi-bin/mailman/listinfo/spoon
Hi Norbert--
I don't want to deal with two things (subject and history). It reads a lot like I need to start two different images to be able to start working. Are there ideas how to hide the fact that there are two things involved?
Yes indeed. The subject memory starts the history memory (via the OSProcess plugin), and manages its run state afterward, so the developer need never think about it.
Or is there even the possibility to have the history memory and the subject memory in one thing?
You could run both memories at once with Hydra (as David Pennell suggests), but I'm not yet sure what effect it could have on crash recovery. If one memory managed to crash Hydra, could another one go down at an inopportune state?.
Don't get me wrong I really like the idea to have these separated. But I want to separate them later not at first. I like to edit my history (and also remove versions).
You can do that from the tools running in the subject memory, but no mention need be made of the history memory per se.
Is there a definition of a fallback behaviour of the history memory if a version is missing?
How would that happen? The only ways I can imagine so far would also take out the entire history memory, and probably the subject memory too (i.e., people should still run backups of their storage).
thanks!
-C
-- Craig Latta improvisational musical informaticist www.netjam.org Smalltalkers do: [:it | All with: Class, (And love: it)]
spoon@lists.squeakfoundation.org