Re: XML Parser choice (was Re: [ENH] ??? MD5 in Squeak.)

29 Nov 2001


      "Duane Maxwell" dmaxwell@san.rr.com wrote:
    The exobox parser is a complete well-formedness, non-validating parser minus
    Unicode support - every obscure little syntax weirdness is handled, even if
    the result is eventually dropped on the floor.
Having written a couple of well-formedness parsers (including one with
Unicode support), let me plead with the Squeak community *not* to accept
a non-validating XML parser.
If you DON'T validate, it is DEAD EASY.  There really isn't a lot to do.
My SAX-style non-validating parser in C literally spends more time in
the read() system call than everywhere else put together.  When an XML
parser runs about as fast as wc(1), you know that XML parsing can't be
hard.  (And yes, every single quirk I know of is handled.)
There are actually three useful "grades" of parser, only two of which are
covered by the XML 1.0 specification:
    - well-formedness checking (basically, do the brackets balance?)
- semi-validation: resolve entity references, fill in #DEFAULT and
      #FIXED attribute values, know when an element allows #PCDATA content
      and when it doesn't, so that element-content white-space can be
      reliably discriminated from text in mixed content.
- full validation: check that the document conforms to the DTD.
If you can do semi-validation, you can handle XHTML and DocBook and
other text formats.
If you can't do semi-validation, you are *severely* limited in the range
of XML that you can usefully handle.  One big issue is that people seem
to be extremely fond of indenting XML; this creates entirely bogus white
space nodes _unless_ you either (x) like one parser I know, delete _all_
white space nodes, which gets a fair bit of text wrong, or (y) at least
semi-validate.  To take a simple example, consider
<example>
<warning>This doesn't seem to be indented but it is.</warning>
<explanation>There are three white space nodes.</explanation>
</example>
Here is the ESIS for that example:
(example
=>  -\n
    (warning
    -This doesn't seem to be indented but it is.
    )warning
=>  -\n
    (explanation
    -There are three white space nodes.
    )explanation
=>  -\n
    )example
I have flagged the white space nodes.  In SGML, those white space nodes
would simply not exist, DTD or no DTD.  In XML, they do.  And people think
XML is simpler!  With a DTD,
    <!ELEMENT example (warning,explanation*)+>
    <!ELEMENT warning (#PCDATA)>
    <!ELEMENT explanation (#PCDATA)>
a validating or semi-validating XML parser would know to report those
newlines as element-content white-space, which can safely be ignored
and never built into an internal representation in the first place.
You _can't_ get this right without looking at the DTD.
We can cross those three grades with some variations:
(1) Character set.
    Handle native character set only?
    Handle ISO Latin 1 only?
    Handle UTF-8 only?
    Handle the full range?
My XML parser in C can be compiled to use Latin-1, UTF-8, UCS-2,
    or UCS-4 internally.  The actual parsing code doesn't know and
    doesn't care.  (Yes this means I will accept characters in
    element and attribute names that I shouldn't.  But all that will
    happen is that illegal input will be accepted.  No legal input
    will be rejected, and no legal input will be mis-parsed.)
    The decoding file dwarfs everything else.
(2) Output method.
    Generate CXML text?
    Generate ESIS text?
    Generate Lisp, Erlang, Prolog text?
    Have SAX-*like* interface?
Have exact SAX interface?
    Have DOM-*like* interface?
    Have exact DOM interface (not actually possible in C)?
    Have Lisp, Erlang, Prolog, Smalltalk interface?
    Have DVM interface?
    Have JDOM interface?
My parser provides the options above the blank line, all based on
    the event-oriented interface (an idea older than SGML, but nowadays
    credited to SAX).  Again, this code dominates the parsing code in
    size.  A SAX-*like* (but not SAX) interface can be small and very
    easy to use.
(3) Support for compression and encryption?
    (I leave this lot to your imagination.  I do none of it.)
(4) Support for SYSTEM identifiers?
    No support (document must be self-contained)?
    SYSTEM identifiers are just local file names?
    SYSTEM identifiers are (relative) file://localhost filenames,
      in URL syntax and converted to form required by host OS?
    SYSTEM identifiers are URIs including support for file:, ftp:, http:?
    SYSTEM identifiers are URIs including https: and SSH ftp?
The main reason my parser doesn't semi-validate yet is that I've
    never written an FTP or HTTP client before; now that I've discovered
    the CURL library, my main problem may be solved.  Needless to say,
    the CURL library is _enormous_ compared with everything else
    put together.
The two operations you need are
    (a) Here is a base URI and another URI which may be relative or
        absolute; return second resolved relative to the first.
    (b) Here is a URI; return me an input stream reading its content
        or fail.
(5) Support for PUBLIC identifiers?
    No support?
    Support for a handful of known PUBLIC identifiers such as XHTML
        and especially the XHTML entity sets?
    Support for PUBLIC identifiers using homebrew catalogues?
    Support for PUBLIC identifiers using OASIS catalogues?
The last is very desirable if you want to handle DocBook or
    many of the things that XML is useful for, but it is about
    as difficult as well-formedness parsing of XML.
The great thing about having an XML parser in Squeak is that the really
hard stuff, like URI processing, compression, and encryption, can be
kept out of your XML parsing code.
If Squeak is to have an "official" XML parser, let it be one