Recently found some old “publications” that I had produced in PDF but the original source files (not Scrivener) are long gone. The recent versions of macOS and iOS/iPadOS have the capability to extract text from an image or from a PDF. I need both essentially. At the moment I have running text from PDF files extracted but any tabular material in there is sometimes strange.
Each entry in the table is supposed to be
Code1 Name 1 Description 1
Status 1
Code2 Name 2 Descripton 2
Status 2
about 50% of the tables come out looking like that but the other 50% come out as
Code1
Status 1
Code2
Status 2
…
…
…
Code20
Status 20
Description 1
Description 2
…
…
…
Description 20
causing me to have to edit those sections into the desire organisation. Wondering if there is some PDF extraction tool that will grab this tabular stuff and produce the desire layout? Apache’s Tika might do it but the only instance of that I have is enbedded in some linguitstic software which also screws the tabular material up. There seem to be several different PDF reader classes for python and ruby. So before I create my limited tool I’m wondering if fellow Scrivener users here know of a program that produces a better result than I’m already getting? If you do then please tell where to find it.
Found something almost useful on an Adobe web site that looks to produce consistent results.
Part of Description Code
Remainder of Description
Status
I can work with that. Some time with MacVim to join the three lines of each record together. Although I’m wondering if Scrivener’s search-and-replace with some fancy regexps would do the trick for me. Oh and I seem to have a better pre-formated block this time around.
PDF OCR X Community or Enterprise Editions† will do it, preserving the table with its borders. It can also take a JPEG and turn it into a searchable PDF.
† I have both… I think I needed the Enterprise edition for OCRing Chinese.
Mark
Edit: The PDF was printed from Numbers with complex tables with a fair number of merged cells. I copied and pasted the result into Nisus Writer Pro, not Scrivener. For testing the Community Edition, I used Graphic Converter to create a JPEG so that it didn’t have the PDF’s searchable layer. I haven’t tried Scrivener to see how it would cope with such a complex table.
Further edit: for JPEGs, PNGs or non-searchable PDFs where the text is in columns and/or interspersed with images, I use ScanThing, which lets you extract blocks of text within the page, so you have less sorting out to do in the end.
Sadly it only converts the first page although to be fair it extracted the tabular data as expected. But horror … the interface is rubbish. I suspect that it originated as a Windows program and given no thought to making the interface fit with macOS.
Following the implicit hint I searched for other PDF OCR programs and found one (OCR Scanner) that did grab every part of the tabular data. Again, the interface is not macOS-style however as it pulls out everything I’ll stick with this one.
I think the Enterprise Edition might work on multi-page documents.
As for the interface, that has never worried me … it’s just a self-contained step in a work path. As I said, I got it because it could OCR Chinese, at a time when I was using PDFPen Pro, at greater cost, and when I enquired, Smile said OCRing Chinese would cost far too much.
But if OCR Scanner does the job for you, that’s all that matters.
If one tries to grab more than a single page it yells about the Enterprise Edition. But the cost of around $30 put me off. Paying that much I would expect close conformance to Apple’s Human Interface Guidelines. And as this is (I hope) a one off activity I baulk at the cost.