[Newbies] htmlcssparser package/discovering size of objects

Wed Jul 30 21:00:33 UTC 2008

Well, as the perpetrator of that bit of hackery, I can certainly  
explain why it gets broken if you let the head object go away.

A node knows its parent through a weak reference, and its offset/ 
length in the original parsed string.  The top object owns the parsed  
string.

When a node tries to print itself it traverses the parents to get the  
original text buffer and then takes the appropriate substring out of  
it and prints that.
This was really useful during debugging since I could see exactly what  
hunk of text each node thought it represented (especially since the  
nodes parse themselves).  Reprinting the document should reproduce the  
original text buffer or something is wrong somewhere.  So that makes  
for a cheap and cheerful integrity check.

Anyhow, it is possible that making the parent weak was perhaps not a  
great choice but it was meant to make some DOM editing operations  
easier in the future (anticipating possible javascript integration).

Two fixes/workarounds.  Either never let go of the root, or change the  
parent code in parsed node to use strong references.  It amounts to  
the same thing.

On Jul 30, 2008, at 7:38 AM, Marcin Tustin wrote:

> Hello everyone, a slightly involved and multi-part question:
> I'm using the package at http://www.squeaksource.com/htmlcssparser  
> (HTML/CSS Parser, or "the parser") to scrape multiple pages (in fact  
> about two or three a day, and about a thousand existing pages), so I  
> can extract parts of them to put into an rss feed. If I let the root  
> object for a parse (the Validator's dom object) be garbage  
> collected, none of the rest of the parse tree really works (because  
> then other objects only referred to weakly get collected, AFAICT).
>
> So, my first question is whether there's a way to assess what kind  
> of memory overhead there would be for keeping each of these objects  
> hanging around indefinitely.
> My second is whether anyone has any advice for another way to do it  
> - by using a different parser, or by copying the data into a  
> different structure somehow, or something else.
> _______________________________________________
> Beginners mailing list
> Beginners at lists.squeakfoundation.org
> http://lists.squeakfoundation.org/mailman/listinfo/beginners

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.squeakfoundation.org/pipermail/beginners/attachments/20080730/f7cc30db/attachment.htm