OCR to recover an old MS?

PJS · January 31, 2016, 8:49pm

Any suggestions for an OCR program?

I have an old print ms — 250 pages — which I’d like to load into the iMac so I can work on it, and page-by-page re-typing promises to be arduous and long.

It’s double-spaced TNR, probably not the easiest to scan but so common (back then) it might have been worked out early in the game.

Thanks for any input.

Phil

xiamenese · February 1, 2016, 11:15am

Hi Phil,

Unless your original pages are very foxed, breaking up, creased, discoloured, etc. OCR should work well, and I don’t think TNR would constitute a problem. But if there are any markings of any kind, they are likely to degrade the output at that point. There are lots of OCR apps out there; I have 2.

1 PDFpen Pro. This is expensive, but it does everything you might need with PDFs. It will also import directly from a scanner and automatically run OCR on it. Generally, it’s results are good, though there are of course problems around the inner margins when it’s a book that has been scanned.

2 I also have PDF OCR X, the “Community Edition”, which is free in the Mac App Store. The “Community Edition” is limited to working on one page at a time, though there is an “Enterprise Edition”—paid for or upgradable to—which removes this limit. I have only used it occasionally. I have it because, interestingly, it can OCR Chinese, where the makers of PDFpen (Pro) say it would be far too expensive to add Chinese to their list of languages.

HTH

Mr X

Hugh · February 1, 2016, 2:49pm

Phil, success with OCR depends not only on the OCR software but also on the quality of the scan, and therefore the quality of the scanner. For “home” users, I think it’s fair to say that the standard has been set by the Fujitsu ScanSnap iX500, which can prepare clear, precise scans quickly (assuming the original consists of single sheets of A4, foolscap or similar). This machine comes bundled with (as well as other useful applications) a “free” version of ABBY Finereader for the Mac OCR software, which has worked well for me for several years, including with pieces written in TNR. But at a smidgen under £500 the iX500 is an expensive choice, and definitely not worth it for a single project, although if you were to borrow an iX500 attached to a computer loaded with Scansnap and ABBY software you could probably complete the task of scanning and OCR’ing your 250 pages within 120 minutes. A standalone version of ABBY Finereader Pro for the Mac costs between £80 and £90 (as far as I remember). ExactScan Pro, which also has OCR software, costs around £80. (If the MS is already bound, you’ll either have to guillotine off the binding or use a book scanner, a type of which I have no experience, but which I believe involves a higher level of cost and complexity.)

P.S. If you can’t get your hands on on an iX500 or similar sheet-fed scanner, or the MS is in book form, you could use an iPad with scanning software (of which there are numerous examples). I’ve had recent experience of this; for 250 pages it would be a long and time-consuming process, and the images would still need OCR’ing, probably on a Mac or PC. One tip I found useful: when scanning, use a piece of thick, heavy glass to place on the MS and keep the pages as flat as possible. Some iOS apps can adjust for page curvature, but not in all cases.

HTH
H

PJS · February 1, 2016, 5:06pm

H and X —

Thank you both for your (as always) thoughtful and reasonable suggestions. Much appreciated, but a complete re-do seems inevitable.

I did scan a couple test pages, with good clean images, and convert them. Then I re-read the original work. Conclusion: Balance the cost of good scanning and time the involved against nuisance of re-typing — then toss on one side what considerable work the old (30 years +/-) story needs, re-typing seems the better course.

It was a short novel at about 55,000 words. It probably will boil down to a 40,000 word novella. I flung words about quite recklessly back in my innocent middle age.

And again, thank you. It did help. (IDH?)

Phil

Jaysen · February 1, 2016, 5:45pm

umm… why not dictate?

I use the OSX feature with not too much trouble. You just speak punctuation…

PJS · February 1, 2016, 7:22pm

Jaysen… funny you should mention it.

Lady of the House made a similar suggestion about an hour ago, which is why I have an answer ready.

First, the reason I didn’t go right away to dictation: I’ve had bad luck with it in the past, finding, in the finished piece, enough confusions and mistakes to outweigh the advantages. By the time I isolated and corrected the problems, I’ve put in more time than if I’d typed in the first place.

Second, what I’m doing now at the urging of LOTH: I’m trying dictation again, being a bit more careful with elocution, and taking trouble to interject punctuation marks accurately. Considering that the output still needs cleaning up end editing, it looks so far like a toss-up.

I’ll keep at it a while. Maybe I’ll be converted. There’s an uncomfortable sense of talking to myself about dictating, but it might be argued that typing is the same thing from a different part of the brain. Big difference seems to be that my fingers — long accustomed to the chore — make no complaints about overuse, but so much talking — guarded and precise — wears out the vocal cords pretty fast.

Still, thanks for the idea.

Phil

Jaysen · February 1, 2016, 8:28pm

it took me about 2 hours to get dictation habits figured out.

don’t speak too slowly.
speak punctuation.
don’t speak edits. Pain and suffering, the world of Mr K, results.
do NOT wait to edit. (read to end, review)

I figure for re-input I save about 30% of the time of typing.

rontarrant · February 1, 2016, 9:04pm

I’ve done this.

I scanned all the pages, then used Acrobat to export them as a Word file. Worked as well as any other OCR software out there and it’s dirt cheap. Rent it for a month @ $15 U.S. or it might work with the evaluation version, IDK.

But do all the scanning before you get Acrobat. Save the images in PNG format for best results (but JPeg will do in a pinch).

gr · February 2, 2016, 5:17am

Put that old draft up in the attic. I and your agent want to find and publish it later when you are deep in your dotage. Tentative title: ‘Go Set a Typescript’. But that could be tweaked a bit when we learn what your story is actually about.

gr

scokar · February 2, 2016, 7:07am

Find a local copyshop that will scan to PDF. Or possibly free at your library. Then use one one of many OCR apps suggested.

AndreasE · February 2, 2016, 8:42am

It should be mentioned that Mr. Ken Follett, one of the most successful authors on this planet (and fully equipped computer-wise) swears by retyping manually the first version of his manuscript as a means to improve it. In fact, he considers this to be one of the key factors of his success, along with profound research.

Of course, every author is different. But a successful OCR solution might be counter-productive.

Hugh · February 2, 2016, 1:00pm

Yes, as E.B. White, author of “Charlotte’s Web” and “Stuart Little”, and co-author of the bible of American writers “The Elements of Style”, stated: “The best writing is re-writing”. (Lots of folk, including me in the past, have attributed this to Ernest Hemingway. But apparently he wrote “The only kind of writing is re-writing,” which it seems to me is a slightly different proposition.)

xiamenese · February 2, 2016, 3:12pm

On re-typing vs OCRing, I can see how retyping would prompt the writer to edit and (hopefully) improve the text on the way, but I can see that causing focus problems resulting from editing while entering. Or, if you type in the whole thing without editing, you may remember some of the changes you want to make, but most of the details will be lost by the time you get to the end.

The thing about OCR is that, at least in my experience, it is never perfect, and needs very careful reading through and editing anyway—I’m driven loco by the appalling editing of published books that have been scanned and OCR’d to turn them into e-books, with resulting misspellings, substitutions of upper case ‘I’ for lower case “l”, garbled ligatures, commas read as full-stops … you name it! So when you’ve OCR’d the text, you need to read through it very closely, and you can do your edits to improve it as you go.

That’s my view anyway.

Mr X

PJS · February 8, 2016, 7:29pm

Thanks to all for your comments and suggestions. I tried each of the proposed systems. The only conclusion, for me, is that there is no one best way to do this. Best will vary with situation and writer.

The one working best for me in this instance is OSX speech-to-text. It’s faster than re-typing, and cheaper and less-complex than OCR conversions with their necessary scanning, and slow enough that I can correct some of the more immediate mistakes en route.

It also, alas, can produce fascinating reconstruction of ordinary English prose, but I’m able to sort it out along the way.

Phil

Jaysen · February 8, 2016, 7:39pm

Wait. That sounds oddly like a suggestion I made.

Dang it all! Now I’m being useful. Vic-k may through me out of the club again.

PJS · February 8, 2016, 8:13pm

Son-of-a-gun. Jaysen, I am sorry. I know how Vic can be in these things. Maybe if we both tell him that I didn’t really use your idea, I only said that to get you in trouble and embarrass him when he called you out on it. Then later we can agree that, no, I did use your idea, and it was that which was meant to embarrass him. So it’s a lose-lose for him.

Maybe a lose-lose-lose.

Phil

vic-k · February 8, 2016, 9:18pm

Hey!! Bring it on!! I can handle it. I’m a born Loser, so… tough titty! pfffrrrrtttt!!!

PJS · February 9, 2016, 2:41am

… so if I read that correctly, you’re saying you’ll be part of the next U.S. Republican Presidential debate. Right?

ps

vic-k · February 9, 2016, 11:16am

‘ey c’mon! f’ feck sake! Gimme a break. There are losers … and there are [size=150]LOSERS![/size]
jeeez … wot y’ like!? tch!tch!

Hugh · February 9, 2016, 12:09pm