Suggestions for Work Flow -- paper to pdf to Scrivener?

SmallFish · February 20, 2017, 9:54am

Hello, and apologies for what might be a question that should be posted elsewhere. I am just starting up my first Scrivener project. I have only the smallest computer and software technical vocabulary. And I might just be taking on a lot more than I can handle as a newbie…

I have at least 6 book boxes overstuffed with HANDWRITTEN documents, on a variety of paper sizes, written over a span of about 10 years. Those documents are currently in random order, although each has a date written on it somewhere on the page. I would like to end up with date-sorted electronic documents, and then to be able to select the occasional useful sentence to refer to while writing in Scrivener. At the same time, I want to maintain a complete date-ordered electronic set of the documents somewhere on my computer.

After days of researching options and reading advice, I may have come up with a partial solution. More specific advice, as well as any suggestions at all, will be greatly appreciated!

My best guess as to what to do is:

Using my iPad, and ScannerPro, take a picture of each document. File each into a folder labeled for one month out of the 10 years (i.e. 120 folders). I am hoping to finish this step over the next 6 weeks – will it be possible?
Somehow get those files onto my MacBook.
Reorder the files within each folder into accurate date order.
Comb through the newly chronologized e-documents, select the worthy sentences.
Using some magic, get a copy of those sentences into Scrivener, while keeping the original of the lengthier compilation intact.
Write the thing I want to write, referring to those select sentences.

Additional questions I have:

Do I need to convert the handwritten text to a typed and computer-readable format at anytime in the work flow? If so, how would I do that? Is it even possible? The USPS seems to have figured this out, but can mere mortals?
Is there any way to scan for the dates and do the sort automatically? I think not, since the dates are handwritten, but one can hope.
What else am I not understanding?

Thanks for whatever assistance you can provide!

Hugh · February 20, 2017, 11:49am

I don’t know enough to respond to all your points, but I have a suggestion - specifically relating to your comment “Using some magic…”

I should try selectively using voice-to-text software, either Apple’s own which your Mac comes equipped with, or Nuance’s Dragon for Mac. The latest version of Dragon for Mac (6.05, if I remember correctly) is by all accounts a buggy and unstable affair (although apparently it works OK with Scrivener - I haven’t tried it), so Apple’s own, perhaps in the enhanced version, will probably be better.

You can manually sort your sheets by date and then go through them one by one, reading into a suitable microphone the date on the sheet and its key sentence. The software will transcribe your words as typed text into whatever writing programme you choose; if it was Scrivener you could then use the programme Search to find the dates you need. Obviously purchase of a microphone appropriate for voice-to-text dictation will be necessary; the website Knowbrainer (http://www.knowbrainer.com) has a guide.

The alternative would involve “optical character recognition” technology - OCR - to turn your sheets into pages that you could then search electronically for the dates and sentences you need (bearing in mind that if you are to search for dates or put the sheets in date order automatically, the software has to able to read and recognise the handwritten dates and compare them with your typed-in search terms). For this, using Evernote (https://evernote.com) - which does do handwriting-OCR for search purposes - may be best. But I personally wouldn’t go down that route, because at the moment handwriting-OCR technology is in my experience quite a lot less developed and less accurate than voice-to-text.

My suggestion could also reduce or remove the need to carry out the scanning that you write about (which may be desirable because scanning can be a very long and tedious procedure if you have a large number of sheets to process - having recently scanned some 100-year-old historical street records with Scanner Pro, believe me, I know!).

derick · February 20, 2017, 4:36pm

I did something similar with +/- 2800 pages of research field notes. Here are a couple of suggestions.

re: 1) if your budget allows, offload as much of the scanning labor as possible. I hired a graduate student to do most of it for me, and she was working on a high-speed scanner. Even if you can’t hire someone, the scanning process will go much much faster if you can find a university library or copy shop with a high-speed scanner. Scanner Pro is a great app - I use it regularly - but when you’re facing 100s of pages it’s not efficient.

re: 3) sort the paper copies before scanning - at least group them by month. Paper is likely to be way easier to work with than opening files, zooming to the page size, scrolling around to find the date etc. Then scan one file per document or for long docs, try to limit to 3-4 pages with logical breaks.

re: 4) again, this may be more easily done on paper & as Hugh suggests you may find a lot of it doesn’t need scanning.

re: 5 & 6) yes, you’ll need to retype, IF you really need full text, or use the voice-to-text option Hugh describes. But the retyping and selection process are really part of the analysis / writing process & can be time well spent.

Finally consider just indexing digitally and leaving the originals on paper. I’m attaching a handout I’ve given to students describing a basic way to do this in Excel, though I use a Filemaker database set up similarly myself. I actually worked with my field notes for about 12 years with only the paper copies and a digital index, typing up key passages thematically but rarely if ever retyping entire pages. In any case, without knowing the project in question, I’d recommend considering indexing as part of step four – if your interest is not just “worthy sentences” but worthy sentences on topic A vs. topic B vs. topic C, indexing will be a big help.
coding.pdf (340 KB)

gr · February 20, 2017, 8:53pm

Scanning with ScannerPro on your ipad is okay for theoccasional scan job, but not for doing a lot of pages. Many copiers today can scan-to-digital. So, I concur on the idea of sorting the originals, then scanning to pdf. In fact, you could then make bookmarks in the pdf at every dated separation and in that way have access to the entire thing by date in one document.

If this is something like a novel project you have been writing notes on over the last ten years, then maybe I could offer two pieces of unsolicited advice: 1) ordering by date may make the most sense for the reclamation of your notes, but you should not take it for granted that it is. Make sure your impulse to order by date is not merely driven by an historical impulse – something which memorializes your years of work, rather than puts it into its most useable form. This is a genuine impulse people have. If your indiv docs are modular enough and don’t skim around too much in topic, then sorting by plotpoint, etc, might be more sensible. And even if sorting by date is the right thing, once you reclaim what you need you should get away from that kind of ordering as fast as possible. 2) Don’t get tied down to your old notes. The only novel that matters is the one that is in your head and that you would write NOW.

gr

Hugh · February 21, 2017, 10:01am

I’d like to echo this. I used to write TV scripts for a living. It was very easy to be wedded - or even metaphorically welded - to notes and other background documentary evidence. But they could become merely a confidence crutch. In retrospect the best scripts I ever wrote were those where the notes, having been well-read before writing began, lay locked in a drawer - available for fact-checking and for ensuring that all the good turns of phrase had been used, but later. (The human brain is a wonderful sifter and shaper of thoughts, when allowed to do its work unencumbered.)

SmallFish · February 22, 2017, 10:03am

Thank you so very much! These are great ideas, and you have helped me to significantly change my previous thinking. I deeply appreciate the responses, especially in such a short time.

I see that I could have included a bit more detail about my project to make my concerns more clear. The final product is going to be a biography, hence the need for chronology. Also, the reason I am hoping to convert all the notes into electronic format is that I am dreadfully allergic to dust, hence the paper needs to be eliminated from my life. Finally, the entire file conversion process needs to happen within a month, since I am moving and can’t take the boxes with me. Do you think it will be possible in that time frame?

I very much like the idea offered up of sorting the paper first, rather than doing it after scanning. Although this will take up a lot of space in a small work room, I agree that it will be much easier than fumbling with mouse and cut and paste and…

I also like the idea of using a high speed scanner rather than ScannerPro. And the idea of bookmarking dates! Thank you! I am investigating if there is such a machine available nearby that has an automatic feeder that can handle pages of many sizes, some front-to-back but not all.

Any additional ideas or suggestions?

Thanks again!

Hugh · February 22, 2017, 10:55am

The best automatic sheet-fed scanner designed for consumers rather than professionals is generally thought to be the Fujitsu Scansnap ix500. It is relatively fast and comes with some excellent software. If the software is set up properly, its images are precise enough for OCR purposes and certainly good enough for reading on-screen. Its biggest weakness is that it cannot handle any sheet larger than A4 (or US Letter size). The Fujitsu Scansnap series of scanners also includes smaller, less expensive, more portable models, but using them will take you longer to complete the job and could prove a false economy. To deal with larger sheets you could of course cut them up to fit the scanner (that’s what I do) or use a non-sheet fed so-called “document scanner” such as the Fujitsu SV600, but using that will also add considerably to the time your scanning project will take.

If you have one month for this project, with the ix500 the time taken for the scanning itself is unlikely to be an issue. As I say, the ix500 is fast. Ten seconds or less is the time it takes to scan both sides of each sheet of mine. What will take up time - I have found - are the preparation, including the aforementioned cutting up for size, and the work afterwards, including renaming the files, OCR-ing them for search purposes (see my post above) - if that is what you wish to do - I think the iX500 comes with OCR software - and filing them so that they are easily retrievable on your computer. Without knowing the number of your handwritten sheets it’s impossible to say how long the entire project would take, but the ix500 would probably give you the best chance of completion within the month deadline if you decide to do the work at home. I have no experience of contracting scanning out.

SmallFish · February 23, 2017, 11:11am

Holy Moly, another project changer!

I have investigated several scanning options, here is what I found in case anyone is interested (assumes 50,000 sheets of paper):
– local copy shop wants $.60/page. PER PAGE! Do they hand feed??? Estimated job outlays: $30,000, two weeks
– nearby professional data center identified, awaiting their response.
– purchase of Fujitsu ScanSnap ix500 is way less expensive than what I had expected, at under $500. Scans 50 ppm, can adapt to Mac OS, can do OCR conversion (of course at a slower rate, approximately 30 ppm), can handle multiple document sizes in one batch dumped into the automatic feeder, can continue transferring data into the same data file from batch to batch (i.e. continuous pdf file creation for multiple batches), can even auto upload into Evernote and DropBox and other services directly without using a computer as intermediary. Estimated job outlay: $500, 17 hours of non-stop crunching if I calculated correctly.

Guess what option is winning so far?

THANK YOU THANK YOU.
Donna.

Hugh · February 23, 2017, 3:08pm

You are very welcome, Donna.

By the way, don’t forget to leave yourself time to learn how to use the ix500 and the software that comes with it (if that’s the option you choose). Both are relatively straightforward, but you will need to understand the software that manages the device and the OCR programme before you begin. Also, give yourself time during your “crunching” for a breather - and your machine. You, and your machine, will deserve it (especially if, as you imply, you have over 6,000 sheets to scan)!

And, of course, be sure to let us know when your book is completed.

reepicheep · February 24, 2017, 6:48pm

Some comments:
• Going to depend on how much you value your own time. Assuming the minimum wage of $7.25 spending two weeks doing nothing else but run a scanner is going to “cost” at least $116. You’d need to add that to the cost of your printer.
• Do the scanner and OCR software you’ve selected cope with handwritten documents? Usual estimate has been that OCR scans are 99% successful on typewritten material, which for an 80 character line means that you can expect at least 1 mistake every two lines that you will need time to correct. If the combination of scanner/software cannot cope with OCRing handwriting very well, if at all, then additional time will be required for “corrections”.
• You will need to add in more hours to correct those OCR errors, which can be quite amusing as this report from the Guardian news paper in the UK points out.
• Unlikely that your chosen domestic scanner can run continuously scanning 50,000 pages. Will it even run continuously for the eight hours of a notional working day?

kewms · February 24, 2017, 7:13pm

I would suggest posting a similar question in the DevonThink forums. There are some people over there who work with large quantities of historical documents and might have some good advice.

You might also want to play with the DevonThink Pro trial a bit. I love Scrivener, but it’s not really designed to manage 50,000 pages of PDF files.

Katherine

gr · February 25, 2017, 1:39pm

I dont think the workflow here requires storing and managing the pdfs inside Scrivener. Presumably, each resilting pdf will cover a span of time, and the collection of them is easily managed in a folder in the Finder. Suitable titling will make sort-by-title also put them in date span order.

It is also not clear the goodness of OCR results matters that much. OP just wants scans of the documents themselves for archiving and dust-free /reviewing/. I imagine a process more like: reading through the resulting pdfs in Preview (maybe doing some markup), typing notes alongside in a floating window from Scrivener. Being able to pull an occasion chunk of OCRed text would just be a happy bonus, even if it needed cleaning up

Remember, the OP is looking for a straightforward path that is not too techy.

SmallFish · February 27, 2017, 10:07am

I am so grateful to everyone who responded to my initial inquiry. Thank you!

After reviewing the information you have provided, and thinking long and hard about my project and dust-free needs, I have decided to buy a high-speed scanner (most likely the Fujitsu mentioned, it seems wonderful especially for my requirement of piling in documents of different sizes in the same batch feed, and will also allow a move towards realising a long-standing dream of becoming paper-free). I will first organise the paper documents chronologically. While scanning I will not use OCR, and if I find a passage that I want to use later on while writing, I will just type it.

Thank you also for the ideas on work in Scrivener after the archiving is done.

You have significantly changed my thinking about this project for the better, and gotten me out of quite a stuck rut! Thank you again.

Hugh · March 2, 2017, 9:09am

Just as a postscript: if anybody reading this thread does contemplate using voice-to-text rather than scanning for digitising material prior to - or during - writing, the very latest version of Nuance’s Dragon for Mac, version 6.0.6, is by all accounts (and in my own brief experience) a big improvement on what had preceded it. Not perfect, but less prone to crashes and errors which for me had made it virtually unusable for long-form use.

SmallFish · March 8, 2017, 2:29pm

Quick Update:

I am downloading the software for my shiny new ScanSnap iX500 just now, and came across instructions on how to scan sheets larger than A4 size on this page:

file:///Applications/ScanSnap%20Manual.localized/iX500ScanSnapManual.app/Contents/Resources/manual/basic/EN/iX500/scan_a3cs_larger_a4.html

Maybe that will save someone else some time.

Regards,
SmallFish.

hittjw · March 9, 2017, 2:28pm

I had this exact same problem. Volumes of hand written notes, journals, and materials from over the years. What worked for me is slightly different than what was covered in this thread. I produce non-fiction training materials and presentations.

My materials represent transcriptions of speeches, training sessions, and reports of findings from more than 20 years of consulting. There were also magazine and marketing clippings, hand written notes, hand drawn diagrams, and pages from work journals.

Group original documents by project, rather than date. For me it was binders and folders bringing the physical documents together into outlines. If a document needed to go in two folders, then I scanned it and put a place holder in my binder. At this point some materials were thrown away, combined, or sent to transcription if better suited as an article.
Scan everything for the project I’m working on into a ScanSnap S510M I got off Craigslist. My assistant scanned some of the documents into an Epson Multi-functioning printer. Both produced reasonably well documents. I let Adobe Distiller do the OCR, which it does well. For the journals, they were sent to a scanning house, or indexed by hand to scan only necessary pages.
Code scanned documents into Scrivener projects. Kind of like a previous writer mentioned, except on a per project basis. I used Paper Tiger to code unscanned materials into file cabinets. With everything per project, I could pull up just what I’m working on. If PDF’s are saved on Google Drive, I can (kind of) search them in Paper Tiger too.
Before turning a scanned document into a chapter or section in Scrivener, I’d write a brief outline of what I’m looking to extract. In hindsight many of these documents are not perfect matches for the end project. Much of the older material needs needed refreshed or were duplicated by newer materials.
Open all my sources full screen on both monitors, print out the outline (or have it in Scrivener’s note panel) then dictate. With Dragon Naturally Speaking I could walk through the material at the speed of sound to only pull what was relevant from original materials. This saved me loads of editing.
Have the original scanned documents and coding available to editors to reference. When doing revisions in Scrivener, or on a sync’d directory of RTF files, bringing able to double check claims, concepts, and clarity items (against originals) is a huge help. I’m almost at the phone where an assistant could dictate the extraction from my outline, then someone else check it in the edit.
At the end of each writing session I export a manuscript proof, note where I left off, and update tasks related to that project in ASANA. I’m the only one in Scrivener, everyone else edits from a shared Dropbox folder that includes sync’d copies. The PDF exported manuscripts are often shared with clients for review.

What I’d like to be able to do next is embed the PDF like images into the Scrivener project. Then my manuscript proofs could contain hand written materials in development. That would give me a more usable output for editors and clients to work with … rather than looking up a document from a place holder.

I definitely recommend the off loading. My kid helps sort and index; a local assistant helps with filing, organizing, scanning, and typing; a virtual secretarial group handles typing; outside scanning groups do bigger packages and odd shaped stacks. However, since everything is grouped by project any new materials are quickly incorporated – and often there is a check waiting for finished documents.

When everything was sorted by year and month rather than topic or project, I couldn’t find anything that made a good output. The per project idea came from my mom who has several humorous short story books. She would stack up photocopies of her stories in piles representing a book, then sit down to type it all up.

With this method she could turn out a book in 30-days. If she had better luck dealing with publishers she would be famous by now. Turns out she has the same name as a famous “paranormal romance novelist” which confused publishers. She also got worked over by a few self publishing companies, and doesn’t want to write under a pen-name for some reason.

Fortunately her other talents of painting, teaching, and crafting are satisfying in retirement.

If anyone knows how to embed PDF as content in Scrivener (incorporating it into a compile) then please let me know. The only way I have found so far is to convert the PDF into images – but that was laborious so place holders are used instead, i.e. [2013_05_06_08_37_36.pdf]

Best,

Justin
Publisher, AdBriefings Copywriting Tips

ScriverTid · March 11, 2017, 10:20am

Ah. No. Only the scans need to be done within the month. Those scans are then electronic files which can live on your computer ready for subsequent conversion stages.

Or, if you take the voice-to-text option, that’s the only part you need to do inside a month.

Another idea is to “farm out” those boxes to someone who can do the scans for you, hopefully fairly cheaply. In that case - provided you agree beforehand on things like file formats - you don’t need to worry about the month’s deadline (unless you’re moving to a different continent!)

devinganger · March 13, 2017, 9:52am

One might assume that some people may need to refer to the original documents to resolve any questions that may arise from scan irregularities…