[squeak-dev] Beyond email, the Social Semantic Desktop

Mon Jan 26 00:50:59 UTC 2009

Ken wrote:
> I agree that for now we should simply move on but perhaps give a little
> thought to what might be a workable alternative if this does turn into a
> real problem in the future.

The recent spam with forged senders to the Squeak list
http://lists.squeakfoundation.org/pipermail/squeak-dev/2009-January/133725.html
suggests public email lists are going to be less and less useful.

I know from one email forged to appear to be from me on one mailing list 
related to Doug Engelbart's Unrev-II colloquium (many years ago), how 
unsettling that can be for the one whose email address gets forged. 
Ultimately such things probably cannot be prevented entirely (security is 
never perfect), but such events can certainly be made less annoying and much 
rarer.

One might ask how Squeak or similar systems could help with that.

Filtering at one point will just never be perfect. One issue with email 
(unlike web forums) is that the community can't easily tag information 
*after* it is sent out in a machine-readable format, like as "forged" or 
even just "boring" or "interesting". One can even imagine tagging email 
information after it is sent with a complex set of information like Slashdot 
does, say, related to moderation by various human moderators, then 
interpreted locally in your email client using different moderator 
reputations you yourself or someone you trust supplies. Perhaps you might 
only read emails to the Squeak list that Tim Rowledge marked as VM-related. 
:-) Or, one can imagine better ways of splitting up long emails into a 
variety of topics after they were sent and creating better links between the 
ideas. So, if we have a way to tag information after it is sent out to a 
bunch of people, we can work collaboratively and stigmergically to build 
information into knowledge (or an approximation thereof). This is in some 
sense how the first email-like system by Doug Engelbart worked with Augment, 
although done on only one central machine. There are some commercial systems 
that do support shared distributed workspaces for communications, but we all 
know a dynamic and open source language platform should be able to support 
this better. :-) Wikis like Ward Cunningham invented in Smalltalk already 
work this way in a sense (continued modifications after the first post), but 
they are generally not distributed and wikis are generally restricted to a 
textual approach (although "semantic wikis" are emerging, like Semantic 
MediaWiki).

Distributed systems like email are really good approaches for a lot of 
reasons like by minimizing bottlenecks if everyone used the same server for 
reading a web forum or by creating redundancy by everyone having a local 
copy of each message after it is sent. So, the basic technical idea of email 
is not that bad in terms of being distributed (likewise Usenet was a really 
good idea in some ways to move stuff around in a distributed way). But email 
is showing its age because email is still perceived mostly as being about 
sending free form text, so there is no obvious way to hook tagging into it. 
And of course HTML mail was a step in the wrong direction because it didn't 
address the general problem of sending complex information people could 
easily collaborate to enhance it, while at the same time HTML mail made 
plain text email more difficult to use. But it still showed one can build 
stuff on top of email, as does the notion of MIME types and attachments. We 
still could use something better, and many agree, but it seems like an 
impossible task to upgrade email and not worth the bother. Backwards 
compatibility is nice, so one can imagine a new system would ideally need to 
support conventional email, perhaps adding extra tagging information inline 
or in attachments. But that all still seems more complicated than it is 
worth. But what if there were other good reasons to switch to a new 
communications platform?

I'd suggest the future may not be so much in reinventing mailing lists and 
email with authentication, but rather more in moving entirely to a new 
multi-purpose distributed paradigm like the "Social Semantic Desktop" (SSD) 
idea:
   http://www.semanticdesktop.org
 From there: "The Internet, electronic mail, and the Web have revolutionized 
the way we communicate and collaborate - their mass adoption is one of the 
major technological success stories of the 20th century. We all are now much 
more connected, and in turn face new resulting problems: information 
overload caused by insufficient support for information organization and 
collaboration. For example, sending a single file to a mailing list 
multiplies the cognitive processing effort of filtering and organizing this 
file times the number of recipients -- leading to more and more of peoples' 
time going into information filtering and information management activities. 
There is a need for smarter and more fine-grained computer support for 
personal and networked information that has to blend the boundaries between 
personal and group data, while simultaneously safeguarding privacy and 
establishing and deploying trust among collaborators."

That page goes on to provide more details including: "P2P and Grid 
computing, especially in combination with the Semantic Web field, develops 
technology to interconnect large communities without centralized 
infrastructures for data and computation sharing, which is necessary to 
build heterogeneous, multi-organizational collaboration networks."

You could imagine a Social Semantic Desktop from one point of view as being 
like a really smart email client, where you could send out emails to mailing 
lists saying how to tag previous emails. But, one can imagine supporting all 
sorts of backends besides email to let small workgroups communicate in 
various ways. And one can imagine having all sorts of applications running 
on top of this distributed infrastructure (my wife and I have a couple new 
ones in mind myself related to manufacturing and storytelling). Squeak does 
have distributed applications like Croquet,
   http://en.wikipedia.org/wiki/Croquet_project
but I'm talking about a more common infrastructure for all these ideas. 
Perhaps the TeaTime TObject idea, while great for what it does with Croquet, 
is not the only approach to abstraction for the general problem of 
collaborative work on information?

I have some notions on this myself which I have been pursuing in Jython for 
the JVM:
   http://sourceforge.net/projects/pointrel/
Sorry it's not in Squeak, although earlier versions of that code before the 
SSD focus from years back were for Squeak. Essentially, I've been working on 
a triple store called Pointrel on-and-off for about a quarter century. That 
work even predates WordNet which was in a tiny way perhaps inspired by it. 
Why isn't it done yet? Good question. :-)

I'd suggest that using Squeak as the basis for a similar or better Social 
Semantic Desktop might be the day that Squeak conquers the world. :-) Or at 
least a bigger part of it than it has already with Seaside, Croquet, 
Scratch, etc.. :-)

Maybe it could be done using TeaTime, but here is another possibility coming 
at this situation from a different perspective.

Here is a rough idea of what I think is a good architecture for a Social 
Semantic Desktop and which I am working towards right now. Although I'm 
working in Jython to leverage Java, obviously anybody could work on in 
Squeak as friendly co-opetition, and so people are welcome to sign up for 
the Pointrel SSD mailing list on SourceForge and use it as a place to bounce 
around Squeak-related version ideas for now. That list:
http://sourceforge.net/mailarchive/forum.php?forum_name=pointrel-discuss
Obviously, the hope is to replace that list using the system itself. :-)
So, it's just for bootstrapping. :-)

Basic implementation ideas:

* Internally, information is stored in the equivalent of RDF triples.
   http://en.wikipedia.org/wiki/Resource_Description_Framework
I'm using a variant of a triple with a context field as well, like NEPOMUK does:
   http://nepomuk.semanticdesktop.org
(NEPOMUK is a SSD attempt for KDE.) Triples are a general purpose way to 
store information, essentially just saying how digital objects link 
together. One might expect there would be ways to place Smalltalk objects 
into sets of triples (or even just strings) and get them back out again. 
Each of the four fields I'm using has both a data field an a namespace 
describing how to interpret the data. Using the RDF naming convention, there 
are "subject", "predicate", and "object" fields. (I actually prefer 
"object", "attribute", "value" which are more OO-like.)  The fourth 
"context" field is sort-of equivalent to a file type and file name. These 
triples are defined in approach I am currently taking using "transactions" 
which are sets of triples to add or remove from the triple store all together.

* Objects are essentially defined by these triples, and their complete 
history is implicit in the list of all transactions which is stored on disk 
like a Smalltalk changes file (although possibly transactions might be 
stored as lots of little files with UUID names until they are periodically 
joined together in a larger file). Storing the entire history of the shared 
database is wasteful in some ways, but disk space is cheap these days; 
likely the whole history will be in memory as well (and that's how I'm doing 
the Jython implementation). For any implementation, the RDF database is 
either entirely in memory (easy for the first try and coding it yourself) or 
on-disk and cached. There are multiple RDF triple-stores that do this 
already on disk with a memory cache in efficient ways, including with query 
languages etc., though not to my knowledge with these distributed 
transaction files like I mention below, though maybe one does?

* Applications like email, shared to-do lists, virtual worlds, shared eToys, 
source code control systems, etc. would then all run on top of this 
distributed system. Not every client might have all transactions, in which 
case their data may have inconsistencies, but applications should be written 
to be forgiving of this. :-) Or, application that are less forgiving would 
require using more exacting backends than email lists (like a central 
database or a IRC-like system or TeaTime for real-time low-latency 
applications). Essentially, different distributed databases might have 
different preferred coordination backends depending on the need for 
reliability or low-latency or public accessibility. This system could 
potentially replace the Squeak changes file and sources file and various 
Squeak source control systems someday, like say if each change to the system 
was written out as a transaction in an XML file. (I've also implemented an 
OO-like system purely on triples that supported a crude Smalltalk-like 
environment as an experiment a few years back, but that's another issue; for 
now I just see this approach as a flexible shared database which transmits 
the data usually in objects, but I point that proof-of-concept out to show 
how general purpose triples can be.)

* Information could be exchanged as "transactions" that are distributed in 
some way like via mailing list email attachments, source code control system 
like git, CGI-based system, remote database, WebDAV, TeaTime, or whatever 
backends one had access to for transmitting these transaction files (or the 
equivalent). I have not implemented this part yet, as the system I have so 
far just uses one big file and for shared use can redirect changes to that 
file through a CGI script which is somewhat like a mailing list archive that 
clients can consult to get all the recent changes they don't have yet. These 
transactions could have digital signatures to make forgeries more difficult 
(though I have not implemented that either). Each transaction is mainly a 
list of triple additions or deletions which modify the triple store (with a 
timestamp and author and licenses granted for each action -- in general the 
system now goes overboard with explicit license granting). Each transaction 
also has a timestamp and the file also specifies which distributed database 
it goes with (via a UUID). (It might make sense to have more than one 
transaction in a file perhaps.) Probably XML is an OK choice for these 
transaction files, which in a sense represent complex messages about how to 
change the state of a triple store. I currently use a different plain text 
format that is easy to read, but I'm thinking of bowing to the inevitable 
standard issue of XML (especially so email attachments could just be 
innocent looking XML file). It really does not matter much what the 
interchanged data looks like, and the local repository could be in a 
different format. Each transaction has a UUID and it is OK to receive the 
same transaction twice as long as it is identical.

There are some other details, but that is the basic picture.

Maybe people well-versed in OO design or distributed systems here might have 
better ideas, or even point to existing systems which might be better 
matches for either Squeak or the JVM than what I propose. If so, I'd 
appreciate hearing about them. Here is one somewhat related idea released as 
open source by NASA:
    http://infolab.stanford.edu/~maluf/papers/xdb_ipg_ggf03.pdf
"This paper describes XDB-IPG, an open and extensible database architecture
that supports efficient  and flexible integration of heterogeneous and
distributed information resources.  XDB-IPG provides a  novel “schema-less”
database approach using a document-centered object-relational XML database
mapping. This enables structured, unstructured, and semi-structured
information to be integrated  without requiring document schemas or
translation tables. XDB-IPG utilizes existing international  protocol
standards of the World Wide Web Consortium Architecture Domain and the
Internet  Engineering Task Force, primarily HTTP, XML and WebDAV . Through a
combination of these  international protocols, universal database record
identifiers, and physical address data types, XDB-IPG enables an unlimited
number of desktops and distributed information sources to be linked
seamlessly and efficiently into an information grid.  XDB-IPG has been used
to create a powerful set  of novel information management systems for a
variety of scientific and engineering applications."

Mine is mainly just simpler. :-)

I do waffle sometimes myself about going back to Squeak to do this (as I 
waffle about spending time on getting a Squeak-like system on the JVM). I am 
really impressed by Scratch, for example, as a stand-alone application. If 
there was a motivated group of people interested here in such a system I 
might be tempted to move back to the Squeak side for a time (maybe for 
prototyping and then seeing how it goes after that), especially as the 
Squeak license issues are getting cleaned up, although I'd be very rusty in 
Squeak at this point so it is not my first choice at this point given my own 
expertise in Jython. But the core I describe here is not very hard to build 
in any language if you do the brute force approach (everything in memory 
when you want to use a distributed database); what is the big time 
requirement is writing the GUI applications on top of the core (stuff that 
works like an email list, or like Wikipedia or Knol, or like SVN, or like 
any of many other systems for stand-alone work or collaboration). Squeak has 
the rudiments of many of those things, so there is a good argument one could 
build on top of, say, Celeste as an email client, or use Scamper to build a 
distributed wiki system. This sort of project might really be facilitated by 
all the years of hard work people have put into Squeak applications, but 
leveraging them all those somewhat copycat Pink Plane efforts in a radically 
new Blue Plane distributed database sense. Even the original Augment code 
that has been redone in Squeak could be integrated for fun. In any case, I'd 
be happy to discuss these issues with Squeakers who wanted to build their 
own system even if I stayed with Jython, with an eye to compatibility for 
the files or other protocols used to interchange transactions or handle 
other issues. Although the more I think about the possibilities of 
leveraging all those previous Squeak efforts to build a self-contained 
environment, the more interesting it sounds to do this in Squeak. Still, 
there obviously are free and open source clients to do email and web 
browsing etc. in a lot of languages, so one would expect that after the 
system was defined and usable, that many other people would adopt the 
distributed back end for their own systems (like write Thunderbird plugins 
or whatever).

I know this all sounds ambitious, but the key idea is that workgroup 
software is already being used, since all it takes to make this useful is a 
few people who want to work together (like with Croquet), and an email 
gateway lets the rest of the world that does not adopt the system still stay 
in touch with the workgroup.

Anyway, I know I'm taking advantage of an unfortunate situation to toot my 
own project's horn, sorry. And I probably would not have sent this email if 
I had not seen that spam, as I'm mostly into Jython right now. Still, 
basically, even regular spam is very annoying as I already pretty much will 
never see email to me that is not one of:
* On a mailing list I have a filter for (but this spam got around that, and 
suggests that approach won't work much longer for public lists),
* Has my name in it (even that is getting iffy since I signed up for some 
Google groups that somehow spammers now connect my name and email),
* Has one of a few other common terms I am interested in (actually the spam 
to Squeakdev triggered a filter I have or I might not have noticed it since 
I don't follow this list closely these days), or
* Is on a whitelist of some senders I know and put on there manually (and 
that may fall apart if the spammers improve in terms of their forgeries).
The rest of my recent email sits in a pile of over 40000 unread messages 
(all spam, I hope, though sometimes I search on it and notice something that 
slipped by).  And that is 40000 spam messages even with SpamAssassin on the 
email server filtering out the worst ones. And that's 40000 spams just since 
last I cleaned that file out some months ago. Granted, my email address has 
been on the web for more than ten years.

But in any case, the best reason to do this is actually not to get rid of 
spam. It is to enable people to stigmergically refine knowledge-related 
digital artifacts that people like Doug Engelbart, Vannevar Bush, Ted 
Nelson, and Theodore Sturgeon envisioned decades ago.
   http://en.wikipedia.org/wiki/Stigmergy

Still, I'd suggest this unfortunate incident might be more motivation for 
Squeakers to do something about email in general to the extent technology 
can help, and the above Social Semantic Desktop idea is one approach which 
some Squeakers might be interested in. No doubt spammers will catch up 
eventually, but in the meanwhile things may improve for a time, plus there 
will be the new distributed applications. In the long term, social change to 
a world of abundance for all may be a better solution, of course, at least 
to reduce the motivation for most commercial spam; so in that sense the 
abundance being facilitated by the internet will help the internet defend 
itself from spammers in one way or another. :-)

--Paul Fernhout
http://www.pdfernhout.net/