XML Parser choice (was Re: [ENH] ??? MD5 in Squeak.)

Thu Nov 29 04:05:37 UTC 2001

"Duane Maxwell" <dmaxwell at san.rr.com> wrote:
	Am I correct that your parser requires recompilation to support
	different character sets?

No.  There are four internal encodings:
    LAT1	ISO Latin 1 (or ASCII)
    UTF8	
    UCS2
    UCS4
The *internal* encoding must be selected at compile time.
There are several external encodings:
    ISO Latin 1
    other 8-bit ASCII extension (table-driven)
    UTF8
    UCS2-big-endian
    UCS2-little-endian
    UCS4-big-endian
    UCS4-little-endian
The *external* encoding is selected at run time.
The hack described in the XML 1.0 specification lets you figure out
UCS2-BE, UCS2-LE, UCS4-BE, UCS4-LE, or some 8-bit code, because the
first character of an XML document must be space, tab, carriage
return, line feed, or '<' (or possibly a byte order mark; I've tried
to determine what Unicode 3.0 + XML 1.0 together say about that, but
I'm at at all sure).  For figuring out which 8-bit code you
have, XML 1.0 says you _have_ to handle UTF8, and _don't_ have to
handle anything else, so the hack could suffice.  But that's why we
have XML declarations.  So you have to start on the assumption that it's
UTF-8 (no, starting on the assumption that it's Latin-1 is *not*
allowed by the XML 1.0 specification) and then switch when you find the
XML declaration.  With 4*7 = 28 combinations, you can see that there's
a lot of code; fortunately some cases are easy, so it's not as bad as
it sounds.  In order to get good performance, the recoding is done a
block at a time.  Unfortunately, that means that I may have to switch
from UTF8 to some other 8-bit code partway through a block, and that
was *tough* to get right.

There are some nasty little quirks in XML.  For example, what if you
auto-detect that the input is UCS2, but then the XML declaration says
the encoding is UTF8?  Or you auto-detect that it's an 8-bit code, but
then the XML declaration says it's UCS2?  On the assumption that such
a clash could come about if a document was recoded by another program
that didn't know or care about XML declarations, I decided to just
write a warning message and keep going.

non-white-space characters

	> (2) Output method.
	>     Generate CXML text?
	>     Generate ESIS text?
	>     Generate Lisp, Erlang, Prolog text?
	>     Have SAX-*like* interface?
	>
	>     Have exact SAX interface?
	>     Have DOM-*like* interface?
	>     Have exact DOM interface (not actually possible in C)?
	>     Have Lisp, Erlang, Prolog, Smalltalk interface?
	>     Have DVM interface?
	>     Have JDOM interface?
	>
	> My parser provides the options above the blank line, all based on
	> the event-oriented interface (an idea older than SGML, but nowadays
	> credited to SAX).  Again, this code dominates the parsing code in
	> size.  A SAX-*like* (but not SAX) interface can be small and very
	> easy to use.

	I believe that none of those are part of the XML specification - those are
	potential representations of the parsed text.

Indeed.  But *some* representation must be provided.  And the DOM is a
W3C recommendation in the same way that XML itself is.

See, there are lots of other variations.

    - Do you support the XML Base recommendation or not?
      [Not in XML 1.0, but W3C think it's part of the Core.]

    - Do you support Namespaces or not?
      [Not in XML 1.0, but W3C think it's part of the Core.]

    - Do you support XML Include or not?
      [Not in XML 1.0, but W3C think it's part of the Core.]

I have been personally assured, via E-mail, by some W3C people, that
they don't regard any parser that doesn't support all of these things
as a "real" XML parser, and if most of the parsers I had ready access
to didn't support them, that was just my tough luck.

	> (3) Support for compression and encryption?
	>     (I leave this lot to your imagination.  I do none of it.)

	AFIAK, compression/encryption is not part of the XML specification per se.

But they ARE part of data transmission.  If we're talking about sending
business or private information over the Internet, then some form of
crypto support had _better_ be there, even if it's only via SSL or SSH.

	Many people do both of these for obvious reasons, but there are
	no standards with regard to XML.  The closest I've seen is the
	binary XML format used by the WAP guys to tokenize XML for quick
	transmission.

I've got a WBXML decompressor/parser too.  Sort of fun.  What I don't
have is a WBXML compressor.  Modulo weak support for entities and notations,
it makes sense for a lot of SGML-style data, not just WML.

	> (4) Support for SYSTEM identifiers?

	Yes, I agree that supporting all of that is problematic.  You forgot
	supporting proprietary Microsoft protocols and protocols yet to be invented.
	That's one of the problems with this section of the XML specification.

Well, a tool that doesn't and can't support any proprietary Microsoft
protocols gets a nice fat tick in the "security features" box on my
checklist.  ftp:, http:, and https: are the ones we're mainly seeing in
(X)HTML, and it would be nice if Scamper could use Squeak's XML parser.

	Agreed.  The code I wrote was meant to be extended to do all
	these wonderful things.  However, the form it is currently in
	adequately scratched the itch it was designed to scratch, and
	development within exobox halted at that point.  It handled
	Jabber, RSS, and various other content feeds, and layout
	specifications, and myriad other little things.

The problem, of course, is money.

	> If Squeak is to have an "official" XML parser, let it be one

	Please find/write one, and I'd be the first to vote for its
	inclusion.

Well, I have a perfectly workable solution for UNIX.  It should be OK for
Windows and MacOS X as well.  Grab an existing XML parser in C such as
the one in SWI Prolog (only handles Latin 1, only handles files, not
http: or ftp: URIs, does pretty much everything else) or ltxml-1.1
(free from the University of Edinburgh, subject to terms of use; see
http://www.ltg.ed.ac.uk/software/xml/
for details, this one does pretty much everything in XML 1.0
), or even take libxml and wrap some kind of main() around it.
To avoid any licence issues, run the parser in another process.
Pipe the information to it, and collect the events coming back.

	I just think holding one's breath waiting for such an animal to
	appear fully formed and functional is a little silly.

Not so.  If we don't insist on rewriting everything, we can have it
right now.  Not as efficient as we'd like; not usable in a Squeak-NOS
environment, but much more capable than anything we've got in Squeak
itself.

	In the meantime, there are some purposes to which any of
	the proposed parsers could be put pending their evolution into
	the uber-parser.

Indeed.

	I suspect what would happen is that as people
	find needs for the more advanced aspects of XML, they'll make
	the necessary changes.

Um, that's not actually _easy_.  I am slowly rewriting my XML parser to
do semi-validation (slowly because I have a lot of other things on my
plate) and it's rather harder than well-formedness parsing.  (I don't
even want to think about Schemas.)  There isn't much you can do without
doing quite a bit.

	That's the way everything else is in
	Squeak, so I see no reason why this would be different.

	The arguments you're making could just as easily be applied to
	Scamper (lack of perfect table support, no Javascript, no https,
	no frames, etc), Celeste (no IMAP, MAPI, WebDAV, etc), or the
	little Telnet client (no VT100 emulation), etc.  Squeak is rife
	with half-baked but nonetheless very useful things.

Somehow it always seems to be the bit _I_ need that doesn't quite work
or isn't quite written yet.  If we all had the same itches to scratch,
we could all use the exobox parser and be happy.  We don't.

If Netscape and IE can't handle tables
perfectly (no, shoving content off the edge of the window is _not_
perfect handling, there really REALLY needs to be an option to tell
the browser 'if you see widths in pixels, add 'em up and convert 'em
to percentages because the drongo that wrote the page assumed everyone
had 72 dpi and US paper and I have neither') I don't think Scamper needs
to.  The absence of Javascript doesn't stop me using Amaya and liking it
very well; indeed that very absence gets another fat tick in the "security
features" box.  Https is a limit.  As for Celeste, that's not all it's
missing, which is why I can't use it on my UNIX box.  I didn't know there
was a Telnet client; I must try it out.  Thanks for the info.

My students have often run into Squeak's rough edges.
In all fairness to Squeak, I must say that there are a LOT of half-baked
XML parsers out there, and it sometimes seems that the entire XML world
is half-baked or worse.

Is there a compromise position?

How about a rough-and-ready not-quite-XML parser written in Squeak
with nice data structures and interfaces.
AND an OSProcess-based wrapper around libxml
with the same data structures and interfaces?