Copying text from pdf files

donncha77 · May 13, 2008, 12:39pm

Hi
When I copy text from a pdf file (within Scrivener) many if not most of the spaces between words are lost. Does anyone know why this happens or if it can be sorted out?

Donncha

KB · May 13, 2008, 5:00pm

It’s because the text system is trying its best to read text that has already been rendered to PDF, so a lot of the information that the text system requires is no longer there. There’s nothing that can be done about it, I’m afraid.

All the best,
Keith

xiamenese · May 14, 2008, 12:32am

That’s odd, because I’ve just tried it to see what happens … I imported a PDF into the research folder, then copied a couple of paragraphs from that and pasted it into a file in the Draft folder and it copied with all the spaces no problem.

It was a PDF I had created myself by printing to PDF from Nisus, so in case that made any difference, I imported a PDF I’d downloaded into the Research folder and then copied and pasted from that … again no problem, all the spaces there as required.

Perhaps, if the problem persists you could try choosing “Open in alternate editor” from the contextual menu — I use Skim — and see if that does better for you.

I’m using 10.5.2 on Intel, in case that also makes a difference.

Mark

cyberbryce · May 14, 2008, 3:11am

Specifically, I think as primarily a page layout format, PDF (often?) does not represent spaces between words, but instead just draws the text on the page. Selection and copying such as is done by the OS X routines Scrivener uses requires it to guess where the spaces belong, and hence it works well for some PDFs or parts of PDFs, but not others.

Depending on how important this is to you, you have some options. There might be some tool that happens to work better in capturing text than the OS X routines for your particular PDF. Some alternative PDF reading and conversion engines are: Adobe Acrobat, xpdf/pdftotext (foolabs.com/xpdf/), and Intaglio (a drawing program for the Mac). (Since Skim, like Scrivener, uses the built-in routines, I suspect it won’t be of much help…)

Finally, another option is OCR: OCR programs are designed to detect word boundaries, so you could load your PDF into an OCR program and scan it. Adobe offers an online service with a trial membership that does this, if you choose the “make documents searchable” option, createpdf.adobe.com/cgi-feeder.p … percapture .

I have exactly the same problem for a large reference document collection (an electronic textbook), and while I’m sure one of these methods would work, it hasn’t been worth the time…

donncha77 · May 14, 2008, 11:45am

Thank you all for your help. Opening in an external editor (I seem to never think of contextual menus) is a fine workaround.

And thanks to one and all (or perhaps just one!) for a very fine piece of software.

Donncha

d_a_friedman · June 26, 2008, 5:24pm

if one already has pdf’s that been ocr’ed (in my case by scanning them in using a canon mx-700 or by downloading them from an online journals server such as jstor) you can use Skim plus a special template to export the skim highlighted notes into a multimarkdown file that can be uploaded in scrivener.

i like to have separate note cards/snippets of text for each highlighted quote from Skim. fortunately, Skim already exports the highlighted passages this say, but not in a format you can use for uploading into scriv.

the solution is to write your own Skim template for exporting the notes.

the file “notesTemplate.txt” needs to be put in the directory library/application support/skim - which you need to create if you dont have one already (dont think Skim creates one).

here is a shot of the template file:

the Skim notes will come out in the format:

, p.

text

this is an mmd compatible format which you can import into Scriv. the result is a lot of snippets of text which serve as notecards for me.

of course, you could read the Skim wiki and mess around with the template to get the text in a different format. i really wanted lots of snippets in skim.

regards,

df

ps thanks to signinstranger on this forum, Christiaan (Skim) and Keith

flow · July 7, 2008, 9:14am

pdftotext is a mac app to convert pdf documents to plain text.