Complexity and starting over on the JVM (ideas)

Wed Feb 6 20:55:01 UTC 2008

Bergel, Alexandre wrote:
> Source code of classes in SmallWorld may be easily extracted. Would be
> easier to do it with Athena however, mainly because I fixed few bugs.

I was thinking of extracting that source, but you need to do a little fancy
stuff for pool dictionaries and such. I did something like that in PataPata
(in Python) where I modified objects to write themselves out as Python
source code which would recreate the original object when run. Actually, I
would be more tempted by the GNU Smalltalk library code than SmallWorld
since I expect it would be more complete (and it is already in source form,
and LGPL licensed I believe, which is not a problem for me, though for
embedded stuff it may get sticky about what that implies if all the
libraries are bundled into some sort of Flash-based distribution).

>> If Athena and Spoon and Dan's project and a few other projects and people
>> could get together somehow, then we might have something even greater,
>> and
>> not several people working mostly on their own. But coming up with a
>> shared
>> vision might be hard?
> 
> In my opinion, the key of success for JRuby, Rhino, JScheme, Jython is
> to enable interaction with Java. When Dan will release its source, I
> might port Athena on his VM. But again, interaction with Java is crucial.

As I think about it more, I agree. I think Squeak on Java if it is like
Talks2 or Dan's may not really by that much, since Squeak VM already runs on
all the Java platforms. It is when you want to make changes which are
equivalent to changes to the VM, or when you want to include compiled code,
where you see the benefit. For me, one of the biggest advantages is that it
makes it possible to write code almost as fast as C code while still having
it be easily cross platform without #ifdefs and lots of testing. So, even if
it is not interoperable with Java, objects, a Squeak on the JVM is of
interest -- but only if you intend to include fast compiled Java code (for
example, to support numerical operations on matrices -- an issue for me as I
have some code with a 3D turtle).

>> For design simplicity and run-time extensibility, that
>> "SmalltalkInstance"
>> class might have a few fields even if they were not usually all used.
>> They
>> might be:
>> * class (usually a reference to another "SmalltalkInstance")
>> * dictionary of instance variables (not an array since I don't want to
>> bother for recompiles in keeping things up to date, and for debugging
>> we can
>> have symbols as selectors not indexes)
> 
> Are you sure that this dictionary is compatible with an efficient lookup ?

On this, I think a dictionary (HashMap, whatever) makes sense for
simplifying the compiler and overall system design. It may not be optimal
speedwise. But I'd rather get that to work right first including supporting
dynamic on-the-fly programming  and then worry about optimizing it later.
These sorts of dictionaries work well enough for Jython and Python. I expect
there to be so much overhead anyway, I think the extra loss of not looking
up data by index won't hardly be noticeable. My biggest limit now is just
programmer time.

>> * dictionary of locally overridden methods (for prototype support)
> 
> Why?

Well, it isn't really necessarily for Smalltalk. But if lookup code checks
this dictionary first, then any instance could have a local method for any
message. It might also be used to break the whole metaclass circularity
problem, since you can just put class methods in this dictionary for classes
(in fact, with this addition, no object then really *needs* a class to have
behavior). But, I'll agree it isn't plain Smalltalk. And I myself also
question the merits of prototype programming these days.
  http://patapata.sourceforge.net/critique.html
"The biggest issue is that I learned that the seductive idea of prototypes
may not be as powerful as its execution (especially as an add-on to an
existing language). Consider what I wrote to someone else: "I've been
repeatedly confronted with the fact that prototype languages are powerful
and elegant but ultimately, like Forth, end up losing something related to
documenting programmer intent. Documenting intent is one of the most
important part of programming, and that is something that Smalltalk's formal
class structure enforces for most code over Self, just like C's named
variables document intent which gets lost manipulating Forth's data stack.
Plus, it is hard for any great language feature to offset issues like lack
of community or lack of comprehensive libraries." And after at person
suggested you usually need to name things before you can share them, I
replied: "And I agree on the issue of "naming" something in order to share
it. Which gets back to the issue I mentioned of "documenting intent". With
the PataPata system I made for Python, there was a choice point where I
decided that prototypes used by other prototypes as parents needed to be
named (a difference from Self). And I liked that choice. But then, what am I
really gaining over Smalltalk (except not having a class side, which is
nice, but is it worth the extra trouble and confusion?)"
So, this leaves me questioning the whole move to prototypes. That person
also pointed out previous work on "Exemplar based Smalltalk", so that is
something I should look into further, perhaps as a compromise with
Prototypes when I understand such previous work better."

Still, it is possibly a great feature for bootstrapping the system.  :-)

>> * array of Java/JVM objects
>> * array of binary data
> 
> What those arrays are for?

Smalltalk typically makes you specify what type of subclass you wont. One
with instance variables as slots. One with an array of binary data. One with
array of pointers. By putting in slots of each of those things, there is no
need to declare this information or worry about transforming such objects.
Every Smalltalk object could then potentially contain binary data or
pointers to Java objects it needed. Again, this is a starting
simplification. Maybe down the road those things might be optimized out.

>> Swapping object identities is handled at the proxy level -- the two
>> proxied
>> objects are just exchanged. So you can see the Proxy level is acting a
>> bit
>> like an object table.
> 
> Is the swapping related to implement "become:"?
> Why would you need this? Since you relying on the Java memory model,
> only Java object that have been "proxyied" may be involved in a become
> operation right? Any motivation for this?

Well, I'm just trying to be generic here as a first cut. If you look at
Dan's comment in this thread, he specifically mentions not using Java
objects to support become: and enumerating object identities.

>> "Becomes" might be handled by changing the became field to true and then
>> changing the proxy to the new object. After that, the dispatching will
>> work
>> differently and just be passed onto the new object without further
>> processing.
> 
> As far as I can tell, most other languages do not provide become. In
> Smalltalk, become is mainly used to do some update in case of changes in
> class definition. 

I think it is also used to proxy-object and for some other reasons. You have
an object you want to monitor (by overriding doesNotUnderstand: for example,
to print the messages and then send them onto the proxied object), so you
want all references to the object to become references to the proxy, and for
the proxy to then hold onto the object.

> Why do you want to keep it if this Smalltal on JVM
> will be file-based ?

I'm not sure a JVM Smalltalk would necessarily be file based. Files
(essentially a hierarchical database of text objects) are just options as to
where to get objects or where to store them IMHO. Put objects in a
PostgreSQL database. Put them in RDF files. Put them out over sockets on the
network. Why should I care where the objects go or where they come from
(other than for security reasons)? Still, I'd be tempted to define a way
that images of object memory could be written out as human-readable source
code (which reconstructs the objects in memory when run) like I did with
PataPata, as a way they could be serialized, perhaps with other text-based
representations possible as well (XML, JavaScript-derived, etc.)

>> Calling into Smalltalk from Java I haven't thought through. I know Jython
>> does it but I have not looked at that. I suspect you could intercept an
>> exception and do something. I'm assuming exceptions could go up the
>> message
>> stack, but I admit that throwing them across intervening Java function
>> calls
>> within the method stack could be problematical -- the message stack might
>> need to be annotated somehow in those cases before calling the function,
>> another performance hit.
> 
> You're right. An exception thrown in a different world may be problematic.

Some issues may not be that easy, or even worth the trouble compared to the
benefit. If I had a Smalltalk with Squeak-ish applications over a
GNU/Smalltalk derived core classes, which could inter operate with Java/JVM
objects  as well as Jython (like calling OpenGL or Concord's Molecular
Workbench libraries
  http://mw.concord.org/modeler/index.html
or even just Java or Scala code I might write myself for speed), and if some
exceptions can't be restarted as well as I would like (though still could be
debugged using a JVM-oriented debugger written in Smalltalk), well, that
might still be very nice result. :-) And a result sure to improve as new
versions of the JVM come out with better dynamic language support. :-)

>> Now, this approach has a higher memory cost. And it also has a loss of
>> runtime performance for Smalltalk code. But I think if you merged in
>> support
>> for Scala (a JVM language which is typed and fast) to compile bottleneck
>> code from Smalltalk->Scala->JVM, the hybrid mix might be pretty fast.
>> Equivalent to the successful CPython/C mix. Or that would be my hope
>> at least.
> 
> I feel that to tackle the performance issue, having a bytecode
> translator Smalltalk -> Java would be an acceptable approach.

I think I'd rather write Smalltalk->Scala and let the Scala team worry about
optimization. That's the kind of stuff they like to do. :-) But I don't
worry about, say, requiring the Java compiler be present either at the
start, if Smalltalk->Java made more sense. Or using something like Kawa
Scheme (for the JVM) as an included JAR file which has a toolkit to generate
JVM bytecode, and so do Smalltalk->Kawa. See:
  http://www.gnu.org/software/kawa/
"Kawa is: A framework written in Java for implementing high-level and
dynamic languages, compiling them into Java bytecodes. ..."
I could even live with a first cut of Smalltalk->Jython. :-)

Then later more stuff can end up in Smalltalk, like generating JVM bytecodes
directly, and maybe performance would improve if it mattered. Part of using
the JVM compare to Squeak's VM is just a difference in philosophy related to
being polylingual instead of monolingual. Use the best JVM tools and
libraries which exist in whatever languages they are in, then if you want,
port those tools to Smalltalk or invent something new. And, these tools will
probably run well on all JVM-supported platforms, whereas right now with
Squeak, you have to think hard about whether a particular C library you
might write or want to use is going to run on multiple platforms.

Still, I'm not interested in *small* embedded systems these days (I used to
be, and I still have some Forth computers with only a few K memory lying
around .:-) I'm more interested in recent desktops or *big* embedded. So
your needs might differ if, say, bundling in a 2MB JAR file for some
intermediate language like Kawa was problematical, or if there was some
other real-time performance issue. But in practice, most everyone has
gigabytes of disk space and likely-as-not now idle CPU cores, so I'm not too
worried about performance or resources for the kind of things Squeak
currently does (GUI, simulation front-end, lightweight web server),
especially if there is the back door of calling speedy Java or Scala code if
there is some bottleneck.

Things that bother me more than resource constraints are questions like, how
can I have an nice Eclipse-like tool written mainly in a Squeak-ish
Smalltalk which lets me *collaboratively* develop and debug code written in
multiple JVM languages and stored in lots of different ways (files, image,
RDF, SQL databases, WebDAV, SVN, Git, etc.) and running across many cores
and many machines?

For me, that's the development environment and demand I expect in the future.

Squeak is a great answer to the older question of, how do I develop code
dynamically myself when it is all written in one language running on a
single core personal computer in a minimal footprint and with the least
external dependencies?

I guess I'm just asking a very different question here. But at the same
time, I want to respect and honor the, say, 90% of Squeak ideals which still
apply when "the network is the computer", and "the developer is the community".

And these are often ideals (like coding in the debugger) which, frankly,
some non-Smalltalkers have a very hard time getting. :-) See:
  "Debugger use"
http://www.cincomsmalltalk.com/blog/blogView?showComments=true&entry=3237275853
  "When your tools suck..."
http://www.cincomsmalltalk.com/blog/blogView?showComments=true&entry=3370104583

Thanks for all your thoughtful comments.

--Paul Fernhout