XML Parser, interleaving text and elements

Tue Sep 9 16:37:28 UTC 2003

Lex Spoon wrote:
> Avi Bryant <avi at beta4.com> wrote:
> 
>>On Thu, 4 Sep 2003 sstnjpm02 at sneakemail.com wrote:
>>
>>
>>>Thanks. I see that my example works properly but I hope I am not trading
>>>one set of problems for another. So far I found one problem which prevents any
>>>of my html from rendering:
>>>
>>><br/> 			prints as <br//>
>>><input href="y"/>   	prints as <input href="y"/> /&gt;
>>>
>>>and various other problems with &gt;....   being added
>>
>>The problem seems to lie with Scamper's HTMLTokenizer class, which the
>>HTML-Parser package reuses.
>>
>>It looks like some hacking of #nextName and #nextTag would be in order.
>>If I get a chance I'll look at that later tonight.
>>
> 
> 
> Well, HTML doesn't have self-closing tags like this.  Are you thinking
> of hacking the tokenizer to return *two* tags when it sees a
> self-closing tag?  I suppose that would be a reasonable way to go, since
> the main goal of these classes is to render.

HTML has tags like <br> which are neither self-closing nor closed by 
anything else.  This is incredibly difficult to parse because you 
actually have to understand the behaviour of each individual tag (thus 
why the html parser I wrote has to basically encode the entire HTML spec 
into code).

XHTML was come up with as a solution to this problem.  It is backwards 
compatible with existing browsers but parses as valid XML.

The XHTML spec says that you should have a space between the tag name 
and the closing /, however, for backwards compatibility with HTML.  I'm 
not sure why I didn't notice this when this message first showed up, but 
is there any chance it just works if you use:

<br />
<input href="y" />  (not that input tags have href attributes... not 
sure where this came from :) )

> What's up with self-closing tags, anyway?  XML throws away all the
> niceties of SGML... and then adds this?  What a nuisance.

Well, it's just a short cut for an empty tag.  You could remove it but 
it isn't hard to parse and it is cleaner to look at (and xml is 
/supposed/ to be human-readable :) ).

Julian