[Newbies] htmlcssparser package/discovering size of objects
tblanchard at mac.com
Wed Jul 30 21:00:33 UTC 2008
Well, as the perpetrator of that bit of hackery, I can certainly
explain why it gets broken if you let the head object go away.
A node knows its parent through a weak reference, and its offset/
length in the original parsed string. The top object owns the parsed
When a node tries to print itself it traverses the parents to get the
original text buffer and then takes the appropriate substring out of
it and prints that.
This was really useful during debugging since I could see exactly what
hunk of text each node thought it represented (especially since the
nodes parse themselves). Reprinting the document should reproduce the
original text buffer or something is wrong somewhere. So that makes
for a cheap and cheerful integrity check.
Anyhow, it is possible that making the parent weak was perhaps not a
great choice but it was meant to make some DOM editing operations
Two fixes/workarounds. Either never let go of the root, or change the
parent code in parsed node to use strong references. It amounts to
the same thing.
On Jul 30, 2008, at 7:38 AM, Marcin Tustin wrote:
> Hello everyone, a slightly involved and multi-part question:
> I'm using the package at http://www.squeaksource.com/htmlcssparser
> (HTML/CSS Parser, or "the parser") to scrape multiple pages (in fact
> about two or three a day, and about a thousand existing pages), so I
> can extract parts of them to put into an rss feed. If I let the root
> object for a parse (the Validator's dom object) be garbage
> collected, none of the rest of the parse tree really works (because
> then other objects only referred to weakly get collected, AFAICT).
> So, my first question is whether there's a way to assess what kind
> of memory overhead there would be for keeping each of these objects
> hanging around indefinitely.
> My second is whether anyone has any advice for another way to do it
> - by using a different parser, or by copying the data into a
> different structure somehow, or something else.
> Beginners mailing list
> Beginners at lists.squeakfoundation.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Beginners