"Duane Maxwell" dmaxwell@san.rr.com wrote: The exobox parser is a complete well-formedness, non-validating parser minus Unicode support - every obscure little syntax weirdness is handled, even if the result is eventually dropped on the floor.
Having written a couple of well-formedness parsers (including one with Unicode support), let me plead with the Squeak community *not* to accept a non-validating XML parser.
If you DON'T validate, it is DEAD EASY. There really isn't a lot to do. My SAX-style non-validating parser in C literally spends more time in the read() system call than everywhere else put together. When an XML parser runs about as fast as wc(1), you know that XML parsing can't be hard. (And yes, every single quirk I know of is handled.)
There are actually three useful "grades" of parser, only two of which are covered by the XML 1.0 specification: - well-formedness checking (basically, do the brackets balance?)
- semi-validation: resolve entity references, fill in #DEFAULT and #FIXED attribute values, know when an element allows #PCDATA content and when it doesn't, so that element-content white-space can be reliably discriminated from text in mixed content.
- full validation: check that the document conforms to the DTD.
If you can do semi-validation, you can handle XHTML and DocBook and other text formats.
If you can't do semi-validation, you are *severely* limited in the range of XML that you can usefully handle. One big issue is that people seem to be extremely fond of indenting XML; this creates entirely bogus white space nodes _unless_ you either (x) like one parser I know, delete _all_ white space nodes, which gets a fair bit of text wrong, or (y) at least semi-validate. To take a simple example, consider
<example> <warning>This doesn't seem to be indented but it is.</warning> <explanation>There are three white space nodes.</explanation> </example>
Here is the ESIS for that example:
(example => -\n (warning -This doesn't seem to be indented but it is. )warning => -\n (explanation -There are three white space nodes. )explanation => -\n )example
I have flagged the white space nodes. In SGML, those white space nodes would simply not exist, DTD or no DTD. In XML, they do. And people think XML is simpler! With a DTD, <!ELEMENT example (warning,explanation*)+> <!ELEMENT warning (#PCDATA)> <!ELEMENT explanation (#PCDATA)> a validating or semi-validating XML parser would know to report those newlines as element-content white-space, which can safely be ignored and never built into an internal representation in the first place. You _can't_ get this right without looking at the DTD.
We can cross those three grades with some variations: (1) Character set. Handle native character set only? Handle ISO Latin 1 only? Handle UTF-8 only? Handle the full range?
My XML parser in C can be compiled to use Latin-1, UTF-8, UCS-2, or UCS-4 internally. The actual parsing code doesn't know and doesn't care. (Yes this means I will accept characters in element and attribute names that I shouldn't. But all that will happen is that illegal input will be accepted. No legal input will be rejected, and no legal input will be mis-parsed.) The decoding file dwarfs everything else.
(2) Output method. Generate CXML text? Generate ESIS text? Generate Lisp, Erlang, Prolog text? Have SAX-*like* interface?
Have exact SAX interface? Have DOM-*like* interface? Have exact DOM interface (not actually possible in C)? Have Lisp, Erlang, Prolog, Smalltalk interface? Have DVM interface? Have JDOM interface?
My parser provides the options above the blank line, all based on the event-oriented interface (an idea older than SGML, but nowadays credited to SAX). Again, this code dominates the parsing code in size. A SAX-*like* (but not SAX) interface can be small and very easy to use.
(3) Support for compression and encryption? (I leave this lot to your imagination. I do none of it.)
(4) Support for SYSTEM identifiers? No support (document must be self-contained)? SYSTEM identifiers are just local file names? SYSTEM identifiers are (relative) file://localhost filenames, in URL syntax and converted to form required by host OS? SYSTEM identifiers are URIs including support for file:, ftp:, http:? SYSTEM identifiers are URIs including https: and SSH ftp?
The main reason my parser doesn't semi-validate yet is that I've never written an FTP or HTTP client before; now that I've discovered the CURL library, my main problem may be solved. Needless to say, the CURL library is _enormous_ compared with everything else put together.
The two operations you need are (a) Here is a base URI and another URI which may be relative or absolute; return second resolved relative to the first. (b) Here is a URI; return me an input stream reading its content or fail.
(5) Support for PUBLIC identifiers? No support? Support for a handful of known PUBLIC identifiers such as XHTML and especially the XHTML entity sets? Support for PUBLIC identifiers using homebrew catalogues? Support for PUBLIC identifiers using OASIS catalogues?
The last is very desirable if you want to handle DocBook or many of the things that XML is useful for, but it is about as difficult as well-formedness parsing of XML.
The great thing about having an XML parser in Squeak is that the really hard stuff, like URI processing, compression, and encryption, can be kept out of your XML parsing code.
If Squeak is to have an "official" XML parser, let it be one
Richard A. O'Keefe wrote:
"Duane Maxwell" dmaxwell@san.rr.com wrote: The exobox parser is a complete well-formedness, non-validating parser
minus
Unicode support - every obscure little syntax weirdness is handled, even
if
the result is eventually dropped on the floor.
Having written a couple of well-formedness parsers (including one with Unicode support), let me plead with the Squeak community *not* to accept a non-validating XML parser.
If you DON'T validate, it is DEAD EASY. There really isn't a lot to do. My SAX-style non-validating parser in C literally spends more time in the read() system call than everywhere else put together. When an XML parser runs about as fast as wc(1), you know that XML parsing can't be hard. (And yes, every single quirk I know of is handled.)
I agree that a wellformedness parser is relatively easy, which is why there are so many of them - when I wrote it, however, there weren't any for Squeak. On the other hand, I think you can count on one hand all of the fully validating parsers in *any* language. It's very tough to implement everything correctly, and generally unnecessary. If we were to wait until such a parser existed under an appropriate Squeak-compatible license in Smalltalk, we'd never have anything. By putting something in now that has the potential of being extended, we at least open to the door to handling XML data even if we let stuff through that might not otherwise survive validation.
If you can do semi-validation, you can handle XHTML and DocBook and other text formats.
That seems like a reasonable extension to the currently available parsers.
If you can't do semi-validation, you are *severely* limited in the range of XML that you can usefully handle. One big issue is that people seem to be extremely fond of indenting XML; this creates entirely bogus white space nodes _unless_ you either (x) like one parser I know, delete _all_ white space nodes, which gets a fair bit of text wrong, or (y) at least semi-validate.
Agreed. I don't think the exobox parser is correct in this case, and should probably be fixed.
We can cross those three grades with some variations: (1) Character set. Handle native character set only? Handle ISO Latin 1 only? Handle UTF-8 only? Handle the full range?
My XML parser in C can be compiled to use Latin-1, UTF-8, UCS-2, or UCS-4 internally. The actual parsing code doesn't know and doesn't care. (Yes this means I will accept characters in element and attribute names that I shouldn't. But all that will happen is that illegal input will be accepted. No legal input will be rejected, and no legal input will be mis-parsed.) The decoding file dwarfs everything else.
I think the limitation with the current Squeak parsers is the lack of Unicode support in Squeak, and nothing else. It was a problem I was not prepared to solve, and it wasn't a deal breaker at the time. Fortunately, in the Real World, the amount of XML in some actual encoding other than 7-bit ASCII or ISO8859 is rather low.
Am I correct that your parser requires recompilation to support different character sets? Doesn't this require advance knowledge of the encoding of the XML and therefore limit the usefulness of your solution? Just curious.
(2) Output method. Generate CXML text? Generate ESIS text? Generate Lisp, Erlang, Prolog text? Have SAX-*like* interface?
Have exact SAX interface? Have DOM-*like* interface? Have exact DOM interface (not actually possible in C)? Have Lisp, Erlang, Prolog, Smalltalk interface? Have DVM interface? Have JDOM interface?
My parser provides the options above the blank line, all based on the event-oriented interface (an idea older than SGML, but nowadays credited to SAX). Again, this code dominates the parsing code in size. A SAX-*like* (but not SAX) interface can be small and very easy to use.
I believe that none of those are part of the XML specification - those are potential representations of the parsed text.
(3) Support for compression and encryption? (I leave this lot to your imagination. I do none of it.)
AFIAK, compression/encryption is not part of the XML specification per se. Many people do both of these for obvious reasons, but there are no standards with regard to XML. The closest I've seen is the binary XML format used by the WAP guys to tokenize XML for quick transmission.
(4) Support for SYSTEM identifiers? No support (document must be self-contained)? SYSTEM identifiers are just local file names? SYSTEM identifiers are (relative) file://localhost filenames, in URL syntax and converted to form required by host OS? SYSTEM identifiers are URIs including support for file:, ftp:, http:? SYSTEM identifiers are URIs including https: and SSH ftp?
The main reason my parser doesn't semi-validate yet is that I've never written an FTP or HTTP client before; now that I've discovered the CURL library, my main problem may be solved. Needless to say, the CURL library is _enormous_ compared with everything else put together.
Yes, I agree that supporting all of that is problematic. You forgot supporting proprietary Microsoft protocols and protocols yet to be invented. That's one of the problems with this section of the XML specification.
The two operations you need are (a) Here is a base URI and another URI which may be relative or absolute; return second resolved relative to the first. (b) Here is a URI; return me an input stream reading its content or fail.
(5) Support for PUBLIC identifiers? No support? Support for a handful of known PUBLIC identifiers such as XHTML and especially the XHTML entity sets? Support for PUBLIC identifiers using homebrew catalogues? Support for PUBLIC identifiers using OASIS catalogues?
The last is very desirable if you want to handle DocBook or many of the things that XML is useful for, but it is about as difficult as well-formedness parsing of XML.
The great thing about having an XML parser in Squeak is that the really hard stuff, like URI processing, compression, and encryption, can be kept out of your XML parsing code.
Agreed. The code I wrote was meant to be extended to do all these wonderful things. However, the form it is currently in adequately scratched the itch it was designed to scratch, and development within exobox halted at that point. It handled Jabber, RSS, and various other content feeds, and layout specifications, and myriad other little things.
If Squeak is to have an "official" XML parser, let it be one
Please find/write one, and I'd be the first to vote for its inclusion. I just think holding one's breath waiting for such an animal to appear fully formed and functional is a little silly. In the meantime, there are some purposes to which any of the proposed parsers could be put pending their evolution into the uber-parser. I suspect what would happen is that as people find needs for the more advanced aspects of XML, they'll make the necessary changes. That's the way everything else is in Squeak, so I see no reason why this would be different.
The arguments you're making could just as easily be applied to Scamper (lack of perfect table support, no Javascript, no https, no frames, etc), Celeste (no IMAP, MAPI, WebDAV, etc), or the little Telnet client (no VT100 emulation), etc. Squeak is rife with half-baked but nonetheless very useful things.
Cheers -
-- Duane
On Wed, 28 Nov 2001, Duane Maxwell wrote:
[snip]
I agree that a wellformedness parser is relatively easy, which is why there are so many of them - when I wrote it, however, there weren't any for Squeak. On the other hand, I think you can count on one hand all of the fully validating parsers in *any* language. It's very tough to implement everything correctly, and generally unnecessary.
Validation is a shifting goal of course. There's DTD validation, XML Schema (run away!), Relax-ng, etc.
If we were to wait until such a parser existed under an appropriate Squeak-compatible license in Smalltalk, we'd never have anything. By putting something in now that has the potential of being extended, we at least open to the door to handling XML data even if we let stuff through that might not otherwise survive validation.
Er...but it seems these considerations support the arguments I've been making. At least the VWXML stuff *attempts* DTD validation, the code owners regard failures in this realm to be bugs, and are committed to extending it (unto XML Schema validation!!!).
If we're going to go for the brass ring, I want to be standing ontop of a tall horse. Or something. :)
I've not picked apart the problem Andreas had with the VWXML parser. It is true that it niether had a terrific interface, nor documentation.
Plus, more than the Squeak community is working on it. Not just Cincom, but other folks. And not just the VisualWorks community.
The VWXML parser is partial. So if being "something incomplete but useful" is a measure of worthiness, it's worthy :)
The licence issue has to be investigated, yes. So too does the new code base (i.e., VW 5i.4). I'm willing to spearhead an effort to port all that which we can pry loose from Cincom (which would result, at least for now, in a (somewhat) validating parser, a partial XSLT engine, some XPath stuff, maybe SOAP support; these are the extant things, the things that already exist).
But I'm willing to believe that it's premature to standardize on the parser/node set. It's not premature to get unicode support, though :)
OTOH, I've mentioned that a very rational, super minimal core can rest under quite a variety of superstructures, including those for validation, etc. If such a core would sensibly replace the guts of the VWXML parser (and a number of others) I'm for it. Or something :)
Cheers, Bijan Parsia.
squeak-dev@lists.squeakfoundation.org