Scanning Books (was Re: (no subject))

Jason Burke jason at squeak-mentors.org
Thu Nov 24 16:09:07 UTC 2005


Hi Marcus,

I totally agree with you on this. Collecting scanned pages into a pdf is
much easier than OCRing the text. However, I'm specifically talking
about ORCing one book ("The Art and Science of Smalltalk) because it has
errors which will probably confuse the target audience (smalltalk
newbies, like myself). I was able to spot the errors, but they did add
some confusion because the majority of the errors were variable and
method names, which have a tendency to act as documentation for the code
snippets.

There's no doubt that OCRing the book will have a great deal of
overhead, and if I can my preference is to just rescan the pages with
errors. However, I may go through the process below just to see how
effective the GOCR is (satisfy my personal curiosity and possibly save
me from having to cut the binding on my copy of the book for scanning).

Thanks for the input,

Jason


On Thu, 2005-11-24 at 09:14 -0300, Marcus Denker wrote:
> Hi,
> 
> The question is if we really want to OCR to a formatted (word, whatever)
> document... for the last number if books added, I just did (or  
> better: asked
> the company) to do a pdf that has the scanned pictures with OCRred text
> layered invisible infront.
> 
> The nice thing with that is: You can search, do copy+paste, but get
> the original pictures of the pages. No proof reading needed, as small  
> errors
> in the OCR don't hurt. It seems to be what all the projects do that  
> scann
> huge number of books (million book project, google, amazon...).
> 
> As the format we used pdf, we want to explore djvu in the future (it's
> much more compact).
> 
> Doing a real new formatting is nice of course, but the time needed to do
> that is quite huge, I think.
> 
> On 24.11.2005, at 04:04, Jason Burke wrote:
> 
> > I've done some research and I think I have a way to OCR the currently
> >
> > existing scanned pages without having to rescan everthing. There's a
> >
> > project out there called GOCR that takes a .pbm file from the command
> >
> > line and OCRs it to a text file. Add the netpbm app, which converts  
> > just
> >
> > about any file format to a .pbm file, and I should be able to get  
> > the current
> >
> > scans into the correct file format (if necessary). After that some  
> > proof-
> >
> > reading, cleaning up, and a bit of reformatting we should have it in a
> >
> > word (or open office) doc that can be used as a master for correcting
> >
> > mistakes and creating the pdfs from.
> >
> >
> > I just got my copy of the book in the mail, so I can correct errors  
> > now.
> >
> > Let me know how you want to transfer the files you have.
> >
> >
> > Jason
> >
> >
> >
> > > Ok keep me informed. after diving in my archive, I can tell you that
> > > I have all the scanned files.
> > >
> > > Stef
> > >
> > > On 23 nov. 05, at 21:56, Jason B Burke wrote:
> > >
> > >
> > > Hello Stef,
> > >
> > > Hmm, I thought this text was OCR'd to begin with. Do you have the
> > > individual pages scanned, or is it only in PDF format right now? If
> > > the
> > > individual pages are scanned then I might want them (maybe I'll just
> > > try to get a better scan on the pages with errors). I already  
> > have the
> > > pdf so I don't need another copy of that (maybe I'll put it through
> > > acrobat
> > > distiller here at work to see what I come up with).
> > >
> > > Ultimately, I'm still waiting for my copy to be delievered from the
> > > book
> > > store, and I won't be able to start until I get that. However, when
> > > it gets
> > > here I'll get in touch with you, and we'll figure this out. I
> > > really want to
> > > see a good copy of this out there since this is a great book  
> > (best one
> > > for beginners that I've seen).
> > >
> > > Thanks,
> > >
> > > Jason
> > >
> > >
> > >
> > > stéphane ducasse <ducasse at iam.unibe.ch>
> > > Sent by: squeak-dev-bounces at lists.squeakfoundation.org
> > > 11/23/2005 02:20 PM
> > > Please respond to The general-purpose Squeak developers list
> > >
> > >
> > > To: The general-purpose Squeak developers list
> > > <squeak-dev at lists.squeakfoundation.org>
> > > cc: (bcc: Jason B Burke/LAKE/CHMS/CONTRACTOR)
> > > Subject: Re: Newbie Squeaker Introduction
> > >
> > >
> > >
> > > Hi jason
> > >
> > > welcome :)
> > > if you want to fix the ocr of the "Art and Science of Smalltalk"  
> > tell
> > > what I can do to help you. Do you want the scanned version?
> > > Do you have an OCR tool?
> > >
> > > Stef
> > >
> > >
> >
> >
> >
> 
> 
> 




More information about the Squeak-dev mailing list