Mark Guzdial guzdial@cc.gatech.edu wrote:
- The <> check currently works by doing an initial scan to find
all ranges of <> pairs, and then checks at each line end whether the current text position falls within one of those ranges. If there are 20 HTML tags on this page, then this means going through 20 calls to between:and: and 20 block invocations AT EACH LINE END. A better way is simply to keep a flag which reflects whether the current position is within a <> pair or not; seeing a < turns it on, and seeing a > turns it off; the check at each end of line then becomes extremely cheap.
Hmm, I just wrote a tiny-and-still-incomplete HTML tag scanner for my class as a demonstration (http://www.cc.gatech.edu/classes/cs2390_99_spring/slides/parse/outline.html). Maybe I can modify that for this purpose. A hand-built scanner will probably be faster than a regular expression system.
To help write scanners by hand, there is an indexOfAnyOf: primitive in the standard VM. This method is just like indexOf: except that you can specify a set of characters to look for instead of just one character. It is more limitted than scanning regular expressions, but it turns out to be sufficient in most cases. (most computer languages don't seem to have tokens that are all THAT complicated) There are a couple of examples of such scanners already in the system: HtmlTokenizer and MailAddressTokenizer.
Lex
squeak-dev@lists.squeakfoundation.org