"Duane Maxwell" dmaxwell@san.rr.com wrote: The exobox parser is a complete well-formedness, non-validating parser minus Unicode support - every obscure little syntax weirdness is handled, even if the result is eventually dropped on the floor.
Having written a couple of well-formedness parsers (including one with Unicode support), let me plead with the Squeak community *not* to accept a non-validating XML parser.
If you DON'T validate, it is DEAD EASY. There really isn't a lot to do. My SAX-style non-validating parser in C literally spends more time in the read() system call than everywhere else put together. When an XML parser runs about as fast as wc(1), you know that XML parsing can't be hard. (And yes, every single quirk I know of is handled.)
There are actually three useful "grades" of parser, only two of which are covered by the XML 1.0 specification: - well-formedness checking (basically, do the brackets balance?)
- semi-validation: resolve entity references, fill in #DEFAULT and #FIXED attribute values, know when an element allows #PCDATA content and when it doesn't, so that element-content white-space can be reliably discriminated from text in mixed content.
- full validation: check that the document conforms to the DTD.
If you can do semi-validation, you can handle XHTML and DocBook and other text formats.
If you can't do semi-validation, you are *severely* limited in the range of XML that you can usefully handle. One big issue is that people seem to be extremely fond of indenting XML; this creates entirely bogus white space nodes _unless_ you either (x) like one parser I know, delete _all_ white space nodes, which gets a fair bit of text wrong, or (y) at least semi-validate. To take a simple example, consider
<example> <warning>This doesn't seem to be indented but it is.</warning> <explanation>There are three white space nodes.</explanation> </example>
Here is the ESIS for that example:
(example => -\n (warning -This doesn't seem to be indented but it is. )warning => -\n (explanation -There are three white space nodes. )explanation => -\n )example
I have flagged the white space nodes. In SGML, those white space nodes would simply not exist, DTD or no DTD. In XML, they do. And people think XML is simpler! With a DTD, <!ELEMENT example (warning,explanation*)+> <!ELEMENT warning (#PCDATA)> <!ELEMENT explanation (#PCDATA)> a validating or semi-validating XML parser would know to report those newlines as element-content white-space, which can safely be ignored and never built into an internal representation in the first place. You _can't_ get this right without looking at the DTD.
We can cross those three grades with some variations: (1) Character set. Handle native character set only? Handle ISO Latin 1 only? Handle UTF-8 only? Handle the full range?
My XML parser in C can be compiled to use Latin-1, UTF-8, UCS-2, or UCS-4 internally. The actual parsing code doesn't know and doesn't care. (Yes this means I will accept characters in element and attribute names that I shouldn't. But all that will happen is that illegal input will be accepted. No legal input will be rejected, and no legal input will be mis-parsed.) The decoding file dwarfs everything else.
(2) Output method. Generate CXML text? Generate ESIS text? Generate Lisp, Erlang, Prolog text? Have SAX-*like* interface?
Have exact SAX interface? Have DOM-*like* interface? Have exact DOM interface (not actually possible in C)? Have Lisp, Erlang, Prolog, Smalltalk interface? Have DVM interface? Have JDOM interface?
My parser provides the options above the blank line, all based on the event-oriented interface (an idea older than SGML, but nowadays credited to SAX). Again, this code dominates the parsing code in size. A SAX-*like* (but not SAX) interface can be small and very easy to use.
(3) Support for compression and encryption? (I leave this lot to your imagination. I do none of it.)
(4) Support for SYSTEM identifiers? No support (document must be self-contained)? SYSTEM identifiers are just local file names? SYSTEM identifiers are (relative) file://localhost filenames, in URL syntax and converted to form required by host OS? SYSTEM identifiers are URIs including support for file:, ftp:, http:? SYSTEM identifiers are URIs including https: and SSH ftp?
The main reason my parser doesn't semi-validate yet is that I've never written an FTP or HTTP client before; now that I've discovered the CURL library, my main problem may be solved. Needless to say, the CURL library is _enormous_ compared with everything else put together.
The two operations you need are (a) Here is a base URI and another URI which may be relative or absolute; return second resolved relative to the first. (b) Here is a URI; return me an input stream reading its content or fail.
(5) Support for PUBLIC identifiers? No support? Support for a handful of known PUBLIC identifiers such as XHTML and especially the XHTML entity sets? Support for PUBLIC identifiers using homebrew catalogues? Support for PUBLIC identifiers using OASIS catalogues?
The last is very desirable if you want to handle DocBook or many of the things that XML is useful for, but it is about as difficult as well-formedness parsing of XML.
The great thing about having an XML parser in Squeak is that the really hard stuff, like URI processing, compression, and encryption, can be kept out of your XML parsing code.
If Squeak is to have an "official" XML parser, let it be one