XML, Squeak, and speed

Fri May 11 03:57:29 UTC 2001

As an exercise in really getting into Squeak, designing my own document
object model (it's hard not to be better than JDOM, let alone DOM) has
been really successful.  I've been strongly motivated to learn lots more
about Squeak, and the more I learn about using the IDE the better I like it.  

I was really REALLY pleased that my non-validating XML parser came to only
274 SLOC of Smalltalk.  (Well, not _that_ pleased, because looking at it,
I can see that it obviously should be shorter.  It's a learning exercise.)

This is not the first XML parser I have written.  It's the second.
My first was a little under a thousand SLOC of C, and is faster than expat.
I want to make that point in order to establish that I know how to make a
non-validating XML parser go screamingly fast.  The parser in Smalltalk has
a very similar structure, but makes much more use of methods, and of course
uses Streams for reading.

As a test case, I took the 'shakespeare.xml' file that comes with the
XMill compressor.  It's actually Antony and Cleopatra.

I really LIKE being able to type in
    d := OxusParser parseFile:
        '/quasar/ustaff/ok/xmill.d/examples/shakespeare.xml'.
    ((d descendants: #SPEAKER)
        collect: [:e | e text asUppercase]) sort grouped
and get output like
    ('AGRIPPA'->29 'ALEXAS'->15 'ALL'->9 'ATTENDANT'->2 'ATTENDANTS'->1
     'CANIDIUS'->10 'CAPTAIN'->1 'CHARMIAN'->63 'CLEOPATRA'->204 'CLOWN'->8
     'DEMETRIUS'->2 'DERCETAS'->5 'DIOMEDES'->7 'DOLABELLA'->23
     'DOMITIUS ENOBARBUS'->113 'EGYPTIAN'->2 'EROS'->27 'EUPHRONIUS'->5
     'FIRST ATTENDANT'->3 'FIRST GUARD'->11 'FIRST SERVANT'->4
     'FIRST SOLDIER'->14 'FOURTH SOLDIER'->3 'GALLUS'->1 'GUARD'->2
     'IRAS'->18 'LEPIDUS'->32 'MARDIAN'->7 'MARK ANTONY'->204 'MECAENAS'->16
     'MENAS'->35 'MENECRATES'->2 'MESSENGER'->42 'OCTAVIA'->13
     'OCTAVIUS CAESAR'->98 'PHILO'->2 'POMPEY'->41 'PROCULEIUS'->10
     'SCARUS'->12 'SECOND ATTENDANT'->1 'SECOND GUARD'->4
     'SECOND MESSENGER'->2 'SECOND SERVANT'->3 'SECOND SOLDIER'->11
     'SELEUCUS'->3 'SILIUS'->3 'SOLDIER'->13 'SOOTHSAYER'->14 'TAURUS'->1
     'THIRD GUARD'->1 'THIRD SOLDIER'->10 'THYREUS'->12 'VARRIUS'->1
     'VENTIDIUS'->4 )

I mean, who needs a query language or XPath when you've got Smalltalk?

However, I pay a price for this ease of manipulation.
The parsers that can validate in this test didn't, because the file
does not contain a doctype.  For the C parsers, I made a new file
containing N copies, and divided the actual time by N.  (N=50 for
my parser, N=20 for the others.)  The C parsers all wrote ESIS output,
directed to /dev/null.

Time	Parser
18.9 s	my XML parser in Squeak 2.7, can't validate, does build tree
 1.55s  nsgmls from SP 1.3, can validate, does it build a tree?
 0.90s  Jan Wielemaker's XML parser, can validate, does build tree
 0.24s  my XML parser in C, can't validate, can build tree but didn't.

That's a speed ratio of about 80 between my two parsers.
Now, my parser in C didn't build a tree, and bypasses the stdio
library in the interests of speed.  Each data character, for
example, touches memory just four times.

Jan Wielemaker's parser is portable C code (well, portable to any GCC;
it took a bit of tweaking to make it compile under MacOS).  It doesn't
get up to hackish tricks.  So it makes a fairer comparison.  The ratio
for that is
    time for simple parser in Smalltalk
    ----------------------------------- = 21
    time for complex parser in C

Considering that Squeak is running a byte-coded VM, that's really not too bad.
I once estimated a factor of 5 for threaded code vs C, and about a factor of
2 for byte code vs threaded code, and C compilers have got rather better since
then, so the ratio is about what you would expect.

Considering that there are affordable machines on the market that are about
4 times as fast as the one I used for the test (might be more, fpversion is
telling me a clock rate I don't believe), this is already fast *enough* to
be usable (most of my files being rather smaller than Antony and Cleopatra!).

I'm pleased at being able to use "explore" to examine XML documents.
It lacks some end-user features that Xeena has (but then, the first time
I tried Xeena, it puked a Java stack dump all over the screen, so that
was the last time I tried Xeena), but it has programmer features I like
better.

If only I understood Text, Paragraphs, and ParagraphEditors well enough,
so that I could render XML documents on-screen and edit them...

Is anyone porting the Jitter to SPARC?