Run external app from squeak

Yanni Chiu yanni at rogers.com
Sun Feb 4 17:58:24 UTC 2007


Sebastian Sastre wrote:
>     The fact is that I want to be able to read and parse some contents in
> PDF files witha 3.9 squeak running on linux. I've looked the PDFReader but I
> cant get access to the text of the pdf with it.

I've used the PDFReader to parse telephone bills (in a prototype).
You can get the text (I can dig out more details, if you're
interested), but the results were highly dependent on the PDFs
themselves.

For example, what would appear on the printed page as a nice
table, with an informational bar on the right, might end up
being fed to the processing logic all mixed together. That is,
you get line 1 from the table, then line 1 from the information
bar, then line 2&3 of the table, because the info. bar font is
bigger. In this case, the content emerges from the PDF in top
to bottom order, without regard to logical elements on the page.
In another case (from a different company's invoice), the behaviour
was different - the content came out grouped by logical elements.

So, any approach based on extracting text from a PDF is going
to be somewhat of a guessing game. You'd have to tune your code
for the PDF examples that you can analyze.




More information about the Squeak-dev mailing list