"Duane Maxwell" dmaxwell@san.rr.com wrote: Am I correct that your parser requires recompilation to support different character sets?
No. There are four internal encodings: LAT1 ISO Latin 1 (or ASCII) UTF8 UCS2 UCS4 The *internal* encoding must be selected at compile time. There are several external encodings: ISO Latin 1 other 8-bit ASCII extension (table-driven) UTF8 UCS2-big-endian UCS2-little-endian UCS4-big-endian UCS4-little-endian The *external* encoding is selected at run time. The hack described in the XML 1.0 specification lets you figure out UCS2-BE, UCS2-LE, UCS4-BE, UCS4-LE, or some 8-bit code, because the first character of an XML document must be space, tab, carriage return, line feed, or '<' (or possibly a byte order mark; I've tried to determine what Unicode 3.0 + XML 1.0 together say about that, but I'm at at all sure). For figuring out which 8-bit code you have, XML 1.0 says you _have_ to handle UTF8, and _don't_ have to handle anything else, so the hack could suffice. But that's why we have XML declarations. So you have to start on the assumption that it's UTF-8 (no, starting on the assumption that it's Latin-1 is *not* allowed by the XML 1.0 specification) and then switch when you find the XML declaration. With 4*7 = 28 combinations, you can see that there's a lot of code; fortunately some cases are easy, so it's not as bad as it sounds. In order to get good performance, the recoding is done a block at a time. Unfortunately, that means that I may have to switch from UTF8 to some other 8-bit code partway through a block, and that was *tough* to get right.
There are some nasty little quirks in XML. For example, what if you auto-detect that the input is UCS2, but then the XML declaration says the encoding is UTF8? Or you auto-detect that it's an 8-bit code, but then the XML declaration says it's UCS2? On the assumption that such a clash could come about if a document was recoded by another program that didn't know or care about XML declarations, I decided to just write a warning message and keep going.
non-white-space characters
> (2) Output method. > Generate CXML text? > Generate ESIS text? > Generate Lisp, Erlang, Prolog text? > Have SAX-*like* interface? > > Have exact SAX interface? > Have DOM-*like* interface? > Have exact DOM interface (not actually possible in C)? > Have Lisp, Erlang, Prolog, Smalltalk interface? > Have DVM interface? > Have JDOM interface? > > My parser provides the options above the blank line, all based on > the event-oriented interface (an idea older than SGML, but nowadays > credited to SAX). Again, this code dominates the parsing code in > size. A SAX-*like* (but not SAX) interface can be small and very > easy to use. I believe that none of those are part of the XML specification - those are potential representations of the parsed text. Indeed. But *some* representation must be provided. And the DOM is a W3C recommendation in the same way that XML itself is.
See, there are lots of other variations.
- Do you support the XML Base recommendation or not? [Not in XML 1.0, but W3C think it's part of the Core.]
- Do you support Namespaces or not? [Not in XML 1.0, but W3C think it's part of the Core.]
- Do you support XML Include or not? [Not in XML 1.0, but W3C think it's part of the Core.]
I have been personally assured, via E-mail, by some W3C people, that they don't regard any parser that doesn't support all of these things as a "real" XML parser, and if most of the parsers I had ready access to didn't support them, that was just my tough luck.
> (3) Support for compression and encryption? > (I leave this lot to your imagination. I do none of it.) AFIAK, compression/encryption is not part of the XML specification per se.
But they ARE part of data transmission. If we're talking about sending business or private information over the Internet, then some form of crypto support had _better_ be there, even if it's only via SSL or SSH.
Many people do both of these for obvious reasons, but there are no standards with regard to XML. The closest I've seen is the binary XML format used by the WAP guys to tokenize XML for quick transmission. I've got a WBXML decompressor/parser too. Sort of fun. What I don't have is a WBXML compressor. Modulo weak support for entities and notations, it makes sense for a lot of SGML-style data, not just WML.
> (4) Support for SYSTEM identifiers?
Yes, I agree that supporting all of that is problematic. You forgot supporting proprietary Microsoft protocols and protocols yet to be invented. That's one of the problems with this section of the XML specification. Well, a tool that doesn't and can't support any proprietary Microsoft protocols gets a nice fat tick in the "security features" box on my checklist. ftp:, http:, and https: are the ones we're mainly seeing in (X)HTML, and it would be nice if Scamper could use Squeak's XML parser.
Agreed. The code I wrote was meant to be extended to do all these wonderful things. However, the form it is currently in adequately scratched the itch it was designed to scratch, and development within exobox halted at that point. It handled Jabber, RSS, and various other content feeds, and layout specifications, and myriad other little things. The problem, of course, is money.
> If Squeak is to have an "official" XML parser, let it be one Please find/write one, and I'd be the first to vote for its inclusion.
Well, I have a perfectly workable solution for UNIX. It should be OK for Windows and MacOS X as well. Grab an existing XML parser in C such as the one in SWI Prolog (only handles Latin 1, only handles files, not http: or ftp: URIs, does pretty much everything else) or ltxml-1.1 (free from the University of Edinburgh, subject to terms of use; see http://www.ltg.ed.ac.uk/software/xml/ for details, this one does pretty much everything in XML 1.0 ), or even take libxml and wrap some kind of main() around it. To avoid any licence issues, run the parser in another process. Pipe the information to it, and collect the events coming back.
I just think holding one's breath waiting for such an animal to appear fully formed and functional is a little silly.
Not so. If we don't insist on rewriting everything, we can have it right now. Not as efficient as we'd like; not usable in a Squeak-NOS environment, but much more capable than anything we've got in Squeak itself.
In the meantime, there are some purposes to which any of the proposed parsers could be put pending their evolution into the uber-parser.
Indeed.
I suspect what would happen is that as people find needs for the more advanced aspects of XML, they'll make the necessary changes.
Um, that's not actually _easy_. I am slowly rewriting my XML parser to do semi-validation (slowly because I have a lot of other things on my plate) and it's rather harder than well-formedness parsing. (I don't even want to think about Schemas.) There isn't much you can do without doing quite a bit.
That's the way everything else is in Squeak, so I see no reason why this would be different.
The arguments you're making could just as easily be applied to Scamper (lack of perfect table support, no Javascript, no https, no frames, etc), Celeste (no IMAP, MAPI, WebDAV, etc), or the little Telnet client (no VT100 emulation), etc. Squeak is rife with half-baked but nonetheless very useful things. Somehow it always seems to be the bit _I_ need that doesn't quite work or isn't quite written yet. If we all had the same itches to scratch, we could all use the exobox parser and be happy. We don't.
If Netscape and IE can't handle tables perfectly (no, shoving content off the edge of the window is _not_ perfect handling, there really REALLY needs to be an option to tell the browser 'if you see widths in pixels, add 'em up and convert 'em to percentages because the drongo that wrote the page assumed everyone had 72 dpi and US paper and I have neither') I don't think Scamper needs to. The absence of Javascript doesn't stop me using Amaya and liking it very well; indeed that very absence gets another fat tick in the "security features" box. Https is a limit. As for Celeste, that's not all it's missing, which is why I can't use it on my UNIX box. I didn't know there was a Telnet client; I must try it out. Thanks for the info.
My students have often run into Squeak's rough edges. In all fairness to Squeak, I must say that there are a LOT of half-baked XML parsers out there, and it sometimes seems that the entire XML world is half-baked or worse.
Is there a compromise position?
How about a rough-and-ready not-quite-XML parser written in Squeak with nice data structures and interfaces. AND an OSProcess-based wrapper around libxml with the same data structures and interfaces?
squeak-dev@lists.squeakfoundation.org