multiple versions of same package vs. mini-images (Was: Re: Guaging & Squeak/JVM)

Mon Feb 11 06:14:33 UTC 2008

Igor-

You suggested "enable multiple versions of same package in same image and
keep track of package dependency". That's been an inspirational suggestion
for me, and I've been thinking about how to implement it for a Squeak/JVM.

I don't have a definite solution yet, but here are some thoughts on it.

I feel it may come down to either picking one of two paths.

We could make a complex system for supporting multiple global system
dictionaries (or the equivalent) to allow multiple applications with
different dependencies to live together in one memory image. That's really
just an extension of the status-quo in some ways, packing ever more stuff
into one bigger and bigger image.

Or, we can break the monolithic image into small images which each just
support one application well (call them "mini-images"). Each mini-image
might in turn depend upon some other common mini-images for defining common
classes. This alternative would probably require Spoon-like
  http://netjam.org/spoon/
remote development and remote-debugging support to work best (but it doesn't
absolutely have to, as there easily could be a development tools mini-image
included by reference even in the tiniest mini-image).

Personally, I think the second approach is ultimately simpler and more
elegant, and does a better job of bringing Smalltalk forward in a now
network-oriented world. See:
  "Principles of Design -- Tim Berners-Lee "
  http://www.w3.org/DesignIssues/Principles.html
"Principles such as simplicity and modularity are the stuff of software
engineering; decentralization and tolerance are the life and breath of
Internet."

You may well know all these issues, but I just thought I'd put it down for
others comments as I understand it (in case I was wrong or missed
something). Probably I'll have outlined some approaches people here know
about already created for Squeak or other systems, and anyone should feel
free to point me to them.

Anyway, feel free to stop reading here, but what follows is more details on
how I came to think about this and arrive at those two possible paths.

=============== how it is now, and a simple approach

The biggest aspect of this is resolving globals. For review, if I recall
correctly this is traditionally done in Squeak by the VM knowing about a
SystemDictionary called "Smalltalk" (the VM needs to know about it
absolutely to resolve a circular dependency of not being able to look up the
global "Smalltalk". :-).  When a CompiledMethod being executed does
something like make a new instance of a class, it fetches the current
instance (typically of a class) associated with the name of the global and
sends it a message or stores it in a variable. Using named globals allows
late binding of classes by the compiled method.

If you didn't care about late binding, like in Forth referring to a
previously defined word, you could just make a hard link as a pointer to the
class at compile time in the compiled method. But then you could not replace
or remove the class in its entirety later.

There is room for only one version of a class at a time this normal way --
just one key in the Smalltalk system dictionary with one value.

The simplest way around this might be to have system dictionary values for
keys be dictionaries. Then you could tag each item with a version. But the
executing code would still need to resolve which one it wanted. And I don't
see how that would be easy. But maybe it might be?

And then there is a deeper problem related to composites of objects which
might include instances pointing to two or more different versions of the
same class. But we can ignore that for now. :-)

== A deeper analysis (or, "owww, my brain hurts". :-)

Python has a straightforward way to resolve this -- it supports a sea of
objects, and when you load code, the old classes get overridden in the
equivalent of a system dictionary with new classes, but the existing
instances still point to the old classes so those still hang around but are
not accessible by name. This makes it difficult to do development in a live
system, and you end up issuing special code to load things in differently
(not making new classes) if you want to do Smalltalk-style dynamic
development. But there is no reason you cannot simply load two version of
the same module (source file) and hang on to them somehow. Squeak could
certainly do something similar if it had modules or classes which could
exist without names.

When I try to generalize this global idea, there are other approaches. In
PataPata (in Python/Jython, trying to retrofit them with Squeak-like
capacities) I gave each object (typically Morphic-like GUI components) a
"world" instance variable. That pointed to what was essentially the
equivalent of a Smalltalk system dictionary to store globals or key
functionality. In practice, each major window was in its own world, although
that wasn't strictly required. Then I could have several worlds in the same
process, where each was somewhat self-consistent.

But objects could still slip from one to another, typically when opening an
inspector//browser tool (itself in its own world) on another world and maybe
copying an object from one place to another. Beyond globals, another reason
for each object to have a pointer to its "world" was that when I serialized
a world I just wanted the objects from that world to be written out and no
others, so I could check that pointer to make sure the serialization wasn't
wandering into writing out objects from other worlds (I didn't pursue the
concept of nested worlds, which might have been possible).

I was planning to use unnamed references to parents from prototypes (for
inherited behavior and constants) in PataPata, based on how Self did
prototypes and links, but I decided in the end to reference prototypes
representing parents by by name, for the purpose of documenting intent. But
that left a global lookup problem, resolved by having *every* prototype have
a "World" pointer. And there were predictable problems when worlds pointed
to themselves which I had to work around (especially when loading worlds).
[Self has a fancier way of getting names for unnamed prototypes I did not
want to try pursuing based on determining paths from a root.]

Anyway, generalizing on this "object-focused late binding lookup" approach,
objects can point to a global system dictionary, or they can point to other
objects in some consistently structured way (typically "parent" or
"container" or "class") which might in turn allow a path to find a global
(that process might even percolate up and then back down, say to *search*
for an object with a certain value; I supported this in PataPata to find
widgets with a certain name in the same window as a widget executing some
behavior).

But there is another way to do this, which is to have the thread, process,
stack frame, or virtual machine hold onto a global system dictionary object
somehow. This is closer to how Squeak does it with a system dictionary,
except there might be one system dictionary per process or thread or stack
frame. The difference is that the entity executing the code knows where to
look for globals even if the objects being used for executing code do not
(which presumably saves on memory, and provides a more consistent notion of
what versions of classes a process want to see, assuming that is a good idea
:-). In a most extreme case, the user running the program might know the
object ID or memory location of the global system dictionary and pass it in
as needed (this might happen in a debugger session). I might call this an
"execution-focused late binding lookup" approach.

For completeness, there is another approach which is to have globals stored
in relation to the memory where the objects are stored (or processes
executed) if memory is partitioned somehow. So if you have an object or
process memory location, you can find the global system dictionary that goes
with it by looking somewhere special in that memory chunk (beginning, end,
standard offset). Deep in the reality of a virtual machine, it might even be
using this approach in various ways (like making sure the pointer to the
system dictionary is, say, the first handle in an object memory table).

Probably someone who has a PhD in computer science could tell me the proper
terms for these approaches towards late binding? :-)

And of course, you can use more than one at a time. NewtonScript, for
example, found variables by having two different types of lookup, based on a
parent slot and visual containment. Maybe you could use all of the
approaches at once in some system just for fun. I don't think I'd want to
debug anything in it through. :-)

Anyway, this doesn't answer how specifically to do what you propose, but it
does suggest some possible points of intervention -- mainly instances or
processes.

But this leads to a deeper point. A Smalltalk VM (or any OO VM system like
it, like the JVM objects or Python objects) has problem with multiple global
objects if objects sharing the same VM in different global spaces can point
at each other directly.

Essentially, if you can have multiple global system dictionaries, you end up
in a situation where an object from a "module" in one set of interconnected
versions of modules can be reference by an object in a "module" in  another
interconnected set of different module versions. At that point, what governs
the objects behavior, specially late binding lookup of globals? Should it be
governed by the module the object came from? Or should it be governed by the
module which it is now connected to? Or should it be governed by the process
executing and calling a method of the object (and that process might lookup
its globals in yet another way)?  And similarly, when you absorb an instance
form another module, should its class still point to the old class or should
it point to the class in the new module?

In general, this issue is a variant of a deeper problem related to OO:
  http://mail.python.org/pipermail/edu-sig/2007-April/007852.html
as I feel the idea, that objects can stand alone and be somehow meaningful,
is at the root of a lot of evil in the Smalltalk universe (e.g. "bitrot". :-)

Anyway, just from random comments here over the years, I get the feeling
that in their hearts the original Squeak Central people (Dan Ingalls
especially) understand this and use heavily customized images in practice as
coherent wholes, but perhaps they have never had the time to generalize this
idea to a philosophical principle. Certainly just fighting for objects at
all, as well as messages and VMs and good tools must have taken up lots of
energy.

Part of this issue may depend on whether you think of an object like a
single-celled creature like an Amoeba, or whether instead you think of an
object as part of a biological entirety, like as a protein molecule in a
cell, or a highly regulated cell in a large multi-cellar entity. If objects
can't meaningfully stand alone, then it seems like we need some coherent
philosophical approach to how they fit together into modules or images.

Loading multiple versions of the same classes seems to strain this possible
coherence, as useful as it might be. It's not that it won;t work, it's just
that the mental complexity starts increasing to the point where you may have
to be really clever (and really alert) to keep track of it all. :-)

=== two competing approaches

Because of all these difficulties and complexity, I'm inclined to lean
towards suggesting that images should be smaller, :-) and a VM's could
either be lightweight or perhaps could support multiple open images at once.
Then you can load one version of a module into a larger set of other
modules, and maintain that set for one application. This total image defines
an ecology of objects, and the objects and their classes all make sense in
relation to each other (as well as whatever I/O they choose to do through
the VM to the rest of the world). This is sort of like a living cell. And
you could then load a different version of code modules into another
*different* image and maintain that set for a different application. And
when these applications want to communicate, it will be from one image to
another, through their different VMs, presumably via sockets or shared
memory or files or whatever, via some common serialization process. There
are already several approaches for distributed objects in Smalltalk, so I
doubt this will be much of a problem, and the JVM and Java offer other
possibilities for remote procedure calls and such. I think that a minimal
image ("mini-image") approach might come closest to bringing some sanity to
the idea of personal images (like Dan Ingalls seems to like). Every image
would be a custom mix of module versions and hacked up base class code. The
image would know with a little developer help which objects belonged to
which modules. To help with this, one would need easy tools to export module
versions and configurations. An important aspect of such an approach might
be Spoon-like remote debugging, and remote development of minimal images so
you could have, say, one image open with your favorite debugging tools and
over a socket just plug those tools into other images you wanted to modify
or debug; this isn't strictly necessary -- but conceptually it makes things
more elegant, especially since then the development tools can have
different versions of  base classes  than thee system being debugged or
developed. I get the feeling the Squeak ecosystem has most of the parts of
all of this, they just haven't been all put together and polished toward
this end.

Still, for the JVM, which is what interests me right now, all the objects do
live in one world, and the JVM has a big memory footprint. So, given memory
footprint and startup time, even with the newer JVM's sharing some memory
across VM instances, I think we might have to end up living with multiple
system dictionaries in one JVM unless JVMs improve further? Or maybe if we
discover they are good enough now? In that case, I end up wondering if a
"world" instance variable added to every underlying Java object is such a
bad idea after all. :-) Or the alternative of a "world" instance variable
stored in each thread (or process) is also possible. Of course, globals are
rarely looked up, so more indirect ways of storing them might be more
efficient trading off time for memory. So this is a second alternative
approach which is closer to the direction you outline.

== best solution long term?

After considering two paths in the previous two paragraphs, I think using
lightweight images with only one system dictionary are a better way to go
long term. They are just simpler and already well understood.

If you, say, want a little clock up on your screen implemented in Squeak
(instead of Lively Kernel :-), you just have a clock image. Ideally, that's
all it does -- it's a clock. If you want to inspect the clock, you fire up
your development image in another JVM and connect to that clock JVM (maybe
using a universal debugging registry service). Maybe your development image
even gives you a copy of the image of the clock window with drag-and-drop
overlays on another screen. Or it might put annotations over the original
window by temporarily inserting a "glasspane" if the clock application was
using Swing widgets, or by the usual Squeak ways if the Clock application
used Morphic widgets.

To save space and maybe help with upgrades, perhaps the Clock application
image depends on another larger base image. I did that in PataPata where
worlds could require other worlds to be loaded first. Since I stored images
as textual Python code which could rebuild a world of objects procedurally,
that worked out OK. Here is an example of simple PataPata world; I would
expect a Squeak clock image built in a similar fashion would be about the
same tiny size and also written out as textual source:
http://patapata.svn.sourceforge.net/viewvc/patapata/tags/PataPata_v204/WorldDandelionGarden.py?revision=315&view=markup
(One fudge, the bitmap was store outside the image in a file.)
Note the line:
  world.worldLibraries = [world.newWorldFromFile("WorldCommon.py")]
which is what defines the other worlds this world depends on. So, for
Squeak, this would be like saying your small image depends on other images
which load first.

Obviously you have to have any supporting images around or you can't load
your dependent one, but for the most part you just typically depend on
common downloaded images. If images are stored as text (essentially, a
Smalltalk program needed to rebuild the image) dependencies are a lot less
scary since you could always just go in and start cutting and pasting in a
text editor (but hopefully there would be better tools for this).

How to track and merge changes to base classes in supporting images is
obviously an issue, and it is not one PataPata tried to solve (beyond the
fact that prototypes made it easy to override base class behavior for most
things). But, since at runtime the supporting packages will be loaded, you
can easily modify it in the live image and then write out a modified version
of the base image again with a different version number, and hope somebody
down the road can reconcile your changes if you want them to move forward
with the supporting image.

In this lightweight approach, images might also become modules stored in
some source code repository if desired, or really, they might become more
like (ENVY-ish?) configuration maps on top of available stored modules. So,
to try to provide an example, you might save your running Clock image as
module Clock-1.1.4 which also depends on BaseClasses-3.4.2. (This would
require a worldwide way to identify Squeak modules uniquely.) Of course you
might not store Clock-1.1.4 on a server; it might be stored on a local drive
(perhaps in a Jar file, leading to Java classpath problems, but nothing is
perfect :-). You might open up Clock-1.1.4, modify it using Spoon-like
remote tools, and maybe even save it back under the same version number if
no one else depended on it (perhaps with an automatic minor sequential
internal revision number bump just in case). These names and version numbers
might also be more like human readable suggestions than absolutes -- for
example each "image" "module" could have a unique UUID (plus perhaps save
sequence) and dependencies could be expressed as lists of acceptable UUIDs
as well as names, with some sort of sophisticated matching algorithm to
trying resolve dependency issues and search for modules various places.

For this Clock example, when you work on the clock you might pull up another
image of development tools (browser, debugger, inspector, and so on). But
the versions of these (or the base classes they depend on) don't really
matter to the clock application. All that matters is that somehow the two
JVMs (or JVM processes) agree on how to talk to each other to add new
methods, return results, single step code, follow object references, and so
on. Presumably one could have a fairly standard protocol for that -- maybe
even an extensible one (perhaps Spoon has this?). Let's say something odd is
happening with the Clock. You want to see how an older version works. Well,
you just open up that older clock image. Then you might even open up a
"image comparing" utility image :-) which lets you connect to both the
running Clock images simultaneously and compare versions of all the classes
looking for differences. Still unsatisfied, maybe you clone the older image
(to start a third clock running) and bit by bit copy classes or modules from
the new image to the copy of the old until you find where the clock starts
to behave oddly. Then you make a change (remotely) to the first clock image
and see if it fixes the problem. Perhaps it turns out your code is perfect
but the anomaly is due to a really deep problem in code supporting
Squeak/JVM -- so you drop down a level conceptually and pull up a JVM
debugger image, or maybe even just Eclipse, :-)
  http://www.eclipsezone.com/eclipse/forums/t53459.html
connect to the JVM supporting that Clock image directly, and start swearing
as you try to figure out what the Squeak/JVM maintainers did wrong this
time. :-) If you wish, all of your actions with the multiple Squeak-ish VMs
could have been logged to some common history repository somewhere to replay
the entire multi-VM development session back to everyone who doesn't believe
you that it's a JVM level issue. :-) Presumably one could build testing
tools for this architecture as well.

And Squeak in C could go down this mini-image route too.

As I think a little more about this, I am still perhaps stuck with the
problem that even in these mini-images, there would need to be some way to
link specific objects back to specific modules so a modified module could be
written back out with all its related objects. This is because a mini-image
is not just code, it is code plus live objects. And so when objects are
created, they would have to be assigned somehow to a specific module or
source mini-image. So, perhaps this mini-image solution needs to have a
"world" field (or "module" or "segment") in every object anyway, just so the
modified objects can be written back out into the right mini-image or
module?  Or, if this was implemented in C, the image would be carved up into
memory segments, with new objects allocated to the chunk of memory going
with the specific min-image that was loaded.

Squeak already has an image segment effort:
 http://wiki.squeak.org/squeak/1213
"ImageSegments and project swapping are still in the experimental stage"
But it is binary, not textual source. And it is based on specific roots, not
some sort of tag for each object. I guess both might take about the same
amount of space -- instead of tagging each item with its segment (world),
you have a big array which points to each object in the segment. Maybe you
might want both? So objects know their segment and segments know their
objects? And I find it a little amusing I am putting up windows in PataPata
defined by textual mini-image files of 3688 bytes (assuming a bitmap loads
off the network or from a local file :-) while they are talking about binary
image segments of 10s of megabytes.

And as I read more on modular Squeak, I'm realizing that with mini-images
the idea of a "project" would probably go away entirely.

And any tool which compared mini-images would have to have some way of
representing objects in two different mini-images so it could look for
similarities and differences. At the very least, maybe like Les Tyrrell's
OASIS project:
  http://wiki.squeak.org/squeak/1056
But there is a big difference between loading representations of objects
(instances or their classes) to look at them and loading objects to use them.

Anyway, no easy solution. But I still think this second mini-image approach
is simpler conceptually than attempting to keep different versions of the
same things in the same VM. Both are possible, of course.

===

Anyway, maybe someone reading this might have a better suggestion or a
better (simpler, clearer) way of looking at this issue.

--Paul Fernhout

Igor Stasenko wrote:
> Ken Causey wrote:
>> [snip]
>> Within this community I've come to feel that the only day to day
>> practical solution is to do it and then ask for forgiveness when it goes
>> all pear shaped (badly).  Of course when that happens it really helps
>> when it is something that can be readily reversed with no harm done.
>> And that's where it seems we have a problem because the current release
>> management schemes don't well-support removing something readily and in
>> such a way that few if any are inconvenienced.  I don't have a ready
>> solution to that, it is something I find myself thinking about more and
>> more.
>> [snip]
>
> There is a solution: enable multiple versions of same package in same
> image and keep track of package dependency.
> So, when you loading an updated package, all code which worked before,
> continues to work in same way as it was before.
> We need a way to be able developer to choose, what parts of system can
> use new version and what should use older version due to
> incompatibility reasons by simply checking dependencies and updating
> dependency links.
> 
> Also, this would help a lot in maintaining packages: a package author
> can easily keep track of his package dependencies, and may or may not
> wish to release his package with updated dependencies, which use
> latest versions of packages, his package depends from.
> 
> Of course, this is somewhat idealistic, and there is many caveats, but
> if done well, will allow us to mix things without fear that something
> will not work due to incompatibilities.