(no subject)

Marcus Denker denker at iam.unibe.ch
Thu Nov 24 12:14:09 UTC 2005


Hi,

The question is if we really want to OCR to a formatted (word, whatever)
document... for the last number if books added, I just did (or  
better: asked
the company) to do a pdf that has the scanned pictures with OCRred text
layered invisible infront.

The nice thing with that is: You can search, do copy+paste, but get
the original pictures of the pages. No proof reading needed, as small  
errors
in the OCR don't hurt. It seems to be what all the projects do that  
scann
huge number of books (million book project, google, amazon...).

As the format we used pdf, we want to explore djvu in the future (it's
much more compact).

Doing a real new formatting is nice of course, but the time needed to do
that is quite huge, I think.

On 24.11.2005, at 04:04, Jason Burke wrote:

> I've done some research and I think I have a way to OCR the currently
>
> existing scanned pages without having to rescan everthing. There's a
>
> project out there called GOCR that takes a .pbm file from the command
>
> line and OCRs it to a text file. Add the netpbm app, which converts  
> just
>
> about any file format to a .pbm file, and I should be able to get  
> the current
>
> scans into the correct file format (if necessary). After that some  
> proof-
>
> reading, cleaning up, and a bit of reformatting we should have it in a
>
> word (or open office) doc that can be used as a master for correcting
>
> mistakes and creating the pdfs from.
>
>
> I just got my copy of the book in the mail, so I can correct errors  
> now.
>
> Let me know how you want to transfer the files you have.
>
>
> Jason
>
>
>
> > Ok keep me informed. after diving in my archive, I can tell you that
> > I have all the scanned files.
> >
> > Stef
> >
> > On 23 nov. 05, at 21:56, Jason B Burke wrote:
> >
> >
> > Hello Stef,
> >
> > Hmm, I thought this text was OCR'd to begin with. Do you have the
> > individual pages scanned, or is it only in PDF format right now? If
> > the
> > individual pages are scanned then I might want them (maybe I'll just
> > try to get a better scan on the pages with errors). I already  
> have the
> > pdf so I don't need another copy of that (maybe I'll put it through
> > acrobat
> > distiller here at work to see what I come up with).
> >
> > Ultimately, I'm still waiting for my copy to be delievered from the
> > book
> > store, and I won't be able to start until I get that. However, when
> > it gets
> > here I'll get in touch with you, and we'll figure this out. I
> > really want to
> > see a good copy of this out there since this is a great book  
> (best one
> > for beginners that I've seen).
> >
> > Thanks,
> >
> > Jason
> >
> >
> >
> > stéphane ducasse <ducasse at iam.unibe.ch>
> > Sent by: squeak-dev-bounces at lists.squeakfoundation.org
> > 11/23/2005 02:20 PM
> > Please respond to The general-purpose Squeak developers list
> >
> >
> > To: The general-purpose Squeak developers list
> > <squeak-dev at lists.squeakfoundation.org>
> > cc: (bcc: Jason B Burke/LAKE/CHMS/CONTRACTOR)
> > Subject: Re: Newbie Squeaker Introduction
> >
> >
> >
> > Hi jason
> >
> > welcome :)
> > if you want to fix the ocr of the "Art and Science of Smalltalk"  
> tell
> > what I can do to help you. Do you want the scanned version?
> > Do you have an OCR tool?
> >
> > Stef
> >
> >
>
>
>




More information about the Squeak-dev mailing list