[squeak-dev] Re: [Pharo-project] HTML parser (again)

Andrei Stebakov lispercat at gmail.com
Wed Aug 18 15:01:24 UTC 2010


Web page scraping. XML parser chokes on bad html input.

On Wed, Aug 18, 2010 at 2:34 AM, laurent laffont
<laurent.laffont at gmail.com> wrote:
>
>
> On Wed, Aug 18, 2010 at 7:50 AM, Andrei Stebakov <lispercat at gmail.com>
> wrote:
>>
>> I've been looking for a nice and fast HTML parser.
>> I've found Zulq Alam's Soup
>> (http://www.squeaksource.com/@vHckXt8_6gVtXFxy/XMrjDbIs) it looks nice
>> but it's way too slow for me (takes 5 sec to parse the page, my
>> current lisp parser takes about 1 sec for that.)
>> I found another one, Todd Blanchard's HTML and CSS parser
>> (http://www.squeaksource.com/@iMgHmTKVxU00wEdz/A0jkqk71) but I
>> couldn't load it into Pharo 1.1 or Squeak 4.1.
>> It complains about some syntax error and leaves the progress bar which
>> I can't kill...
>> I wonder if anyone (Todd?) can take a look at the parser and figure
>> out how to fix it?
>>
>> What other options I have for an HTML parser?
>> Looking at Pharo speed I wonder if there is any way to optimize it? Is
>> JIT or some other speed optimization in plans for Pharo/Squeak?
>
>
> What do you need to do ?
> There's XMLSupport http://www.squeaksource.com/XMLSupport.html
> Scamper might have a standalone HTML
> parser http://www.squeaksource.com/Scamper.html
> The CogVM has JIT.
> Laurent.
>
>>
>> Thank you,
>> Andrei
>>
>> _______________________________________________
>> Pharo-project mailing list
>> Pharo-project at lists.gforge.inria.fr
>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>
>
>
>
>



More information about the Squeak-dev mailing list