XML Parser choice (was Re: [ENH] ??? MD5 in Squeak.)

Wed Nov 28 23:43:00 UTC 2001

"Duane Maxwell" <dmaxwell at san.rr.com> wrote:
	The exobox parser is a complete well-formedness, non-validating parser minus
	Unicode support - every obscure little syntax weirdness is handled, even if
	the result is eventually dropped on the floor.

Having written a couple of well-formedness parsers (including one with
Unicode support), let me plead with the Squeak community *not* to accept
a non-validating XML parser.

If you DON'T validate, it is DEAD EASY.  There really isn't a lot to do.
My SAX-style non-validating parser in C literally spends more time in
the read() system call than everywhere else put together.  When an XML
parser runs about as fast as wc(1), you know that XML parsing can't be
hard.  (And yes, every single quirk I know of is handled.)

There are actually three useful "grades" of parser, only two of which are
covered by the XML 1.0 specification:
    - well-formedness checking (basically, do the brackets balance?)

    - semi-validation: resolve entity references, fill in #DEFAULT and
      #FIXED attribute values, know when an element allows #PCDATA content
      and when it doesn't, so that element-content white-space can be
      reliably discriminated from text in mixed content.

    - full validation: check that the document conforms to the DTD.

If you can do semi-validation, you can handle XHTML and DocBook and
other text formats.

If you can't do semi-validation, you are *severely* limited in the range
of XML that you can usefully handle.  One big issue is that people seem
to be extremely fond of indenting XML; this creates entirely bogus white
space nodes _unless_ you either (x) like one parser I know, delete _all_
white space nodes, which gets a fair bit of text wrong, or (y) at least
semi-validate.  To take a simple example, consider

<example>
<warning>This doesn't seem to be indented but it is.</warning>
<explanation>There are three white space nodes.</explanation>
</example>

Here is the ESIS for that example:

    (example
=>  -\n
    (warning
    -This doesn't seem to be indented but it is.
    )warning
=>  -\n
    (explanation
    -There are three white space nodes.
    )explanation
=>  -\n
    )example

I have flagged the white space nodes.  In SGML, those white space nodes
would simply not exist, DTD or no DTD.  In XML, they do.  And people think
XML is simpler!  With a DTD,
    <!ELEMENT example (warning,explanation*)+>
    <!ELEMENT warning (#PCDATA)>
    <!ELEMENT explanation (#PCDATA)>
a validating or semi-validating XML parser would know to report those
newlines as element-content white-space, which can safely be ignored
and never built into an internal representation in the first place.
You _can't_ get this right without looking at the DTD.

We can cross those three grades with some variations:
(1) Character set.
    Handle native character set only?
    Handle ISO Latin 1 only?
    Handle UTF-8 only?
    Handle the full range?

	My XML parser in C can be compiled to use Latin-1, UTF-8, UCS-2,
	or UCS-4 internally.  The actual parsing code doesn't know and
	doesn't care.  (Yes this means I will accept characters in
	element and attribute names that I shouldn't.  But all that will
	happen is that illegal input will be accepted.  No legal input
	will be rejected, and no legal input will be mis-parsed.)
	The decoding file dwarfs everything else.

(2) Output method.
    Generate CXML text?
    Generate ESIS text?
    Generate Lisp, Erlang, Prolog text?
    Have SAX-*like* interface?

    Have exact SAX interface?
    Have DOM-*like* interface?
    Have exact DOM interface (not actually possible in C)?
    Have Lisp, Erlang, Prolog, Smalltalk interface?
    Have DVM interface?
    Have JDOM interface?

	My parser provides the options above the blank line, all based on
	the event-oriented interface (an idea older than SGML, but nowadays
	credited to SAX).  Again, this code dominates the parsing code in
	size.  A SAX-*like* (but not SAX) interface can be small and very
	easy to use.

(3) Support for compression and encryption?
    (I leave this lot to your imagination.  I do none of it.)

(4) Support for SYSTEM identifiers?
    No support (document must be self-contained)?
    SYSTEM identifiers are just local file names?
    SYSTEM identifiers are (relative) file://localhost filenames,
      in URL syntax and converted to form required by host OS?
    SYSTEM identifiers are URIs including support for file:, ftp:, http:?
    SYSTEM identifiers are URIs including https: and SSH ftp?

	The main reason my parser doesn't semi-validate yet is that I've
	never written an FTP or HTTP client before; now that I've discovered
	the CURL library, my main problem may be solved.  Needless to say,
	the CURL library is _enormous_ compared with everything else
	put together.

	The two operations you need are
	(a) Here is a base URI and another URI which may be relative or
	    absolute; return second resolved relative to the first.
	(b) Here is a URI; return me an input stream reading its content
	    or fail.

(5) Support for PUBLIC identifiers?
    No support?
    Support for a handful of known PUBLIC identifiers such as XHTML
        and especially the XHTML entity sets?
    Support for PUBLIC identifiers using homebrew catalogues?
    Support for PUBLIC identifiers using OASIS catalogues?

	The last is very desirable if you want to handle DocBook or
	many of the things that XML is useful for, but it is about
	as difficult as well-formedness parsing of XML.

The great thing about having an XML parser in Squeak is that the really
hard stuff, like URI processing, compression, and encryption, can be
kept out of your XML parsing code.

If Squeak is to have an "official" XML parser, let it be one