XML Parser choice (was Re: [ENH] ??? MD5 in Squeak.)

Duane Maxwell dmaxwell at san.rr.com
Thu Nov 29 01:15:49 UTC 2001


Richard A. O'Keefe wrote:
> "Duane Maxwell" <dmaxwell at san.rr.com> wrote:
> The exobox parser is a complete well-formedness, non-validating parser
minus
> Unicode support - every obscure little syntax weirdness is handled, even
if
> the result is eventually dropped on the floor.
>
> Having written a couple of well-formedness parsers (including one with
> Unicode support), let me plead with the Squeak community *not* to accept
> a non-validating XML parser.
>
> If you DON'T validate, it is DEAD EASY.  There really isn't a lot to do.
> My SAX-style non-validating parser in C literally spends more time in
> the read() system call than everywhere else put together.  When an XML
> parser runs about as fast as wc(1), you know that XML parsing can't be
> hard.  (And yes, every single quirk I know of is handled.)

I agree that a wellformedness parser is relatively easy, which is why there
are so many of them - when I wrote it, however, there weren't any for
Squeak.  On the other hand, I think you can count on one hand all of the
fully validating parsers in *any* language.  It's very tough to implement
everything correctly, and generally unnecessary.  If we were to wait until
such a parser existed under an appropriate Squeak-compatible license in
Smalltalk, we'd never have anything.  By putting something in now that has
the potential of being extended, we at least open to the door to handling
XML data even if we let stuff through that might not otherwise survive
validation.

> If you can do semi-validation, you can handle XHTML and DocBook and
> other text formats.

That seems like a reasonable extension to the currently available parsers.

> If you can't do semi-validation, you are *severely* limited in the range
> of XML that you can usefully handle.  One big issue is that people seem
> to be extremely fond of indenting XML; this creates entirely bogus white
> space nodes _unless_ you either (x) like one parser I know, delete _all_
> white space nodes, which gets a fair bit of text wrong, or (y) at least
> semi-validate.

Agreed.  I don't think the exobox parser is correct in this case, and should
probably be fixed.

> We can cross those three grades with some variations:
> (1) Character set.
>     Handle native character set only?
>     Handle ISO Latin 1 only?
>     Handle UTF-8 only?
>     Handle the full range?
>
> My XML parser in C can be compiled to use Latin-1, UTF-8, UCS-2,
> or UCS-4 internally.  The actual parsing code doesn't know and
> doesn't care.  (Yes this means I will accept characters in
> element and attribute names that I shouldn't.  But all that will
> happen is that illegal input will be accepted.  No legal input
> will be rejected, and no legal input will be mis-parsed.)
> The decoding file dwarfs everything else.

I think the limitation with the current Squeak parsers is the lack of
Unicode support in Squeak, and nothing else.  It was a problem I was not
prepared to solve, and it wasn't a deal breaker at the time.  Fortunately,
in the Real World, the amount of XML in some actual encoding other than
7-bit ASCII or ISO8859 is rather low.

Am I correct that your parser requires recompilation to support different
character sets?  Doesn't this require advance knowledge of the encoding of
the XML and therefore limit the usefulness of your solution?  Just curious.

> (2) Output method.
>     Generate CXML text?
>     Generate ESIS text?
>     Generate Lisp, Erlang, Prolog text?
>     Have SAX-*like* interface?
>
>     Have exact SAX interface?
>     Have DOM-*like* interface?
>     Have exact DOM interface (not actually possible in C)?
>     Have Lisp, Erlang, Prolog, Smalltalk interface?
>     Have DVM interface?
>     Have JDOM interface?
>
> My parser provides the options above the blank line, all based on
> the event-oriented interface (an idea older than SGML, but nowadays
> credited to SAX).  Again, this code dominates the parsing code in
> size.  A SAX-*like* (but not SAX) interface can be small and very
> easy to use.

I believe that none of those are part of the XML specification - those are
potential representations of the parsed text.

> (3) Support for compression and encryption?
>     (I leave this lot to your imagination.  I do none of it.)

AFIAK, compression/encryption is not part of the XML specification per se.
Many people do both of these for obvious reasons, but there are no standards
with regard to XML.  The closest I've seen is the binary XML format used by
the WAP guys to tokenize XML for quick transmission.

> (4) Support for SYSTEM identifiers?
>     No support (document must be self-contained)?
>     SYSTEM identifiers are just local file names?
>     SYSTEM identifiers are (relative) file://localhost filenames,
>       in URL syntax and converted to form required by host OS?
>     SYSTEM identifiers are URIs including support for file:, ftp:, http:?
>     SYSTEM identifiers are URIs including https: and SSH ftp?
>
> The main reason my parser doesn't semi-validate yet is that I've
> never written an FTP or HTTP client before; now that I've discovered
> the CURL library, my main problem may be solved.  Needless to say,
> the CURL library is _enormous_ compared with everything else
> put together.

Yes, I agree that supporting all of that is problematic.  You forgot
supporting proprietary Microsoft protocols and protocols yet to be invented.
That's one of the problems with this section of the XML specification.

> The two operations you need are
> (a) Here is a base URI and another URI which may be relative or
>     absolute; return second resolved relative to the first.
> (b) Here is a URI; return me an input stream reading its content
>     or fail.
>
> (5) Support for PUBLIC identifiers?
>     No support?
>     Support for a handful of known PUBLIC identifiers such as XHTML
>         and especially the XHTML entity sets?
>     Support for PUBLIC identifiers using homebrew catalogues?
>     Support for PUBLIC identifiers using OASIS catalogues?
>
> The last is very desirable if you want to handle DocBook or
> many of the things that XML is useful for, but it is about
> as difficult as well-formedness parsing of XML.
>
> The great thing about having an XML parser in Squeak is that the really
> hard stuff, like URI processing, compression, and encryption, can be
> kept out of your XML parsing code.

Agreed.  The code I wrote was meant to be extended to do all these wonderful
things.  However, the form it is currently in adequately scratched the itch
it was designed to scratch, and development within exobox halted at that
point.  It handled Jabber, RSS, and various other content feeds, and layout
specifications, and myriad other little things.

> If Squeak is to have an "official" XML parser, let it be one

Please find/write one, and I'd be the first to vote for its inclusion.  I
just think holding one's breath waiting for such an animal to appear fully
formed and functional is a little silly.  In the meantime, there are some
purposes to which any of the proposed parsers could be put pending their
evolution into the uber-parser.  I suspect what would happen is that as
people find needs for the more advanced aspects of XML, they'll make the
necessary changes.  That's the way everything else is in Squeak, so I see no
reason why this would be different.

The arguments you're making could just as easily be applied to Scamper (lack
of perfect table support, no Javascript, no https, no frames, etc), Celeste
(no IMAP, MAPI, WebDAV, etc), or the little Telnet client (no VT100
emulation), etc.  Squeak is rife with half-baked but nonetheless very useful
things.

Cheers -

-- Duane





More information about the Squeak-dev mailing list