HtmlParser Re: [ENH]Html table (second version)

Karl Ramberg karl.ramberg at chello.se
Tue Aug 28 11:51:22 UTC 2001


"Randal L. Schwartz" wrote:
> 
> >>>>> "Richard" == Richard A O'Keefe <ok at atlas.otago.ac.nz> writes:
> 
> Richard>        John Hinsley's table example is very ill-formed.  Yes, there is
> Richard>        ill-formed HTML in the real world, but I suggest that you focus
> Richard>        first on getting Scamper fully functional with properly
> Richard>        formatted HTML containing balanced tags.
> 
> Richard> Why the insistence on balanced tags?  There's a lot of well
> Richard> written HTML out there (with the W3C's validation stamps on
> Richard> it yet) that doesn't use space-bloating balanced tags.
> 
> Richard> For each element type, note whether it allows text or not,
> Richard> and which other element types it allows.  Whenever you
> Richard> encounter text or a tag, if it is not allowed by the current
> Richard> element, close and pop elements that allow their end-tags to
> Richard> be omitted until you find a element that does the new item,
> Richard> or an element that doesn't allow its end-tag to be omitted.
> Richard> That won't let you reconstruct omitted start tags (such as
> Richard> <HTML>, <HEAD>, and <BODY>, but it _will_ let you reconstruct
> Richard> a well-bracketed tree, so you can deal with valid HTML.
> 
> I support this approach.  The HTML DTD gives a clear list of what
> ending tags can be omitted, and what elements can be within other
> elements.  A proper browser implements these rules properly so that
> well-formed HTML (which may omit closing tags) can be parsed.
> 
> Note that the "closing tags may be omitted" mess of HTML makes HTML
> harder to parse, although not impossible to parse.  That's why XML got
> rid of this, forcing all closing tags to be present.
> 
> I don't think you need to put *error*-correcting into any early
> release of Scamper.  But knowing about omitted end tags is not
> error-correcting, it's parsing legal HTML!
> 
The whole parser is basically one method. Take a look at
HtmlParser class-parseTokens: tokenStream 
	| entityStack document head token matchesAnything entity body |
	entityStack _ OrderedCollection new.
	"set up initial stack"
	document _ HtmlDocument new.
	entityStack add: document.
	head _ HtmlHead new.
	document addEntity: head.
	entityStack add: head.
	"go through the tokens, one by one"
	[token _ tokenStream next.
	token = nil]
		whileFalse: [(token isTag
					and: [token isNegated])
				ifTrue: ["a negated token"
					(token name ~= 'html'
							and: [token name ~= 'body'])
						ifTrue: ["see if it matches anything in the stack"
							matchesAnything _ (entityStack
										detect: [:e | e tagName = token name]
										ifNone: []) isNil not.
							matchesAnything
								ifTrue: ["pop the stack until we find the right 
									one "
									[entityStack last tagName ~= token name]
										whileTrue: [entityStack removeLast].
									entityStack removeLast]]]
				ifFalse: ["not a negated token. it makes its own entity"
					token isComment
						ifTrue: [entity _ HtmlCommentEntity new initializeWithText: token source].
					token isText
						ifTrue: [entity _ HtmlTextEntity new text: token text.
							((entityStack last shouldContain: entity) not
									and: [token source isAllSeparators])
								ifTrue: ["blank text may never cause the stack 
									to back up"
									entity _ HtmlCommentEntity new initializeWithText: token source]].
					token isTag
						ifTrue: [entity _ token entityFor.
							entity = nil
								ifTrue: [entity _ HtmlCommentEntity new initializeWithText:
token source]].
					token name = 'body'
						ifTrue: [body
								ifNotNil: [document removeEntity: body].
							body _ HtmlBody new initialize: token.
							document addEntity: body.
							entityStack add: body].
					entity = nil
						ifTrue: [self error: 'could not deal with this token'].
					entity isComment
						ifTrue: ["just stick it anywhere"
							entityStack last addEntity: entity]
						ifFalse: ["only put it in something that is valid"
							[entityStack last mayContain: entity]
								whileFalse: [entityStack removeLast].
							"if we have left the head, create a body"
							(entityStack size < 2
									and: [body isNil])
								ifTrue: [body _ HtmlBody new.
									document addEntity: body.
									entityStack add: body].
							"add the entity"
							entityStack last addEntity: entity.
							entityStack addLast: entity]]].
	body == nil
		ifTrue: ["add an empty body"
			body _ HtmlBody new.
			document addEntity: body].
	document parsingFinished.
	^ document

It's not impossible to modify this to deal with no end tags. Each tag know
which token it can contain. The way it's done now is that each branch is 
trimed in from the end looking for negated tokens, if I understand this
right. 
If this was done from the beginnig of the branch and was a check for 
negated or if it could be contained it should be able to deal with no
end tags.

Karl




More information about the Squeak-dev mailing list