Split pdf files?

Cadence · October 19, 2010, 10:34pm

I’m trialing DevonThink Pro Office. I have a few gigs of pdf documents, many of them quite long. All OCR’d. I had DTP index my pdfs instead of importing them, I performed some searches, and…came to realize that to get the max out of DTP, I need the software to work with smaller documents – like pages and paragraphs.
So…is there a way to split looooong pdf documents into pages?

~A currently depressed Cady.

Eddy · October 19, 2010, 10:56pm

For small files (<20Mb) there is an online site that purports to do it at splitpdf.net/

Also I found this with a quick google: mac.softpedia.com/get/Utilities/ … erge.shtml - haven’t checked it out but looks like it might work for you

A more commercial offering can be found here: smilesoftware.com/PDFpen/index.html - it’s a licence cost but if you are a heavy user of PDF’s I can see it being a worthwhile investment if it solves your problems.

I’m sure there are more.

Eddy

Cadence · October 19, 2010, 11:14pm

Hmmm. Thanks. Looked at both. PDFpen looked interesting. Maybe the “Automate PDF manipulations with AppleScript” thingy mentioned in the Features section might allow automatically splitting documents. I wonder if DevonThink could auto-split documents?

Jaysen · October 19, 2010, 11:21pm

can you copy/paste the text in the pdf?

Cadence · October 19, 2010, 11:42pm

Yes, I can copy-paste. I just tested some of my documents to make sure.

Jaysen · October 19, 2010, 11:42pm

give me a bit.

Jaysen · October 20, 2010, 12:11am

Take a look at pdflabs.com/docs/pdftk-cli-examples/

Is pretty easy to use and can do what you are looking for. The method I was going to use gets too ugly and if i am going to grow a new head I would like one that looks good, not like applescript spaghetti.

Cadence · October 20, 2010, 12:22am

Thanks Jaysen!
Gonna try out pdflabs before purchasing anything.

BTW, I’m now trying to imagine your pic with a spaghetti-head attached. LOL!!!

Jaysen · October 20, 2010, 12:30am

You should see if one of the fam will snap a photo when we go to an Italian place. Trust me it is not pretty.

The tool I recommended is free. And you can use automator to make it simpler. If I was in a less grumpy mood (like I should be tomorrow) it might be possible to talk me into whipping something together for you.

Cadence · October 20, 2010, 12:52am

Yup I saw the tool’s free. I meant “before purchasing anything like PDFpen” LOL.

Yup, tell 'em to snap that photo. I’m thinking…fusilli:

Hey, I’ll work on it. I have a Mac since June, about time I get to know automator.

Jaysen · October 20, 2010, 1:05am

While many see automator and applescript and perl as a “right of passage” I tend to think of them more like the mechanic thinks of the auto repair books in the library; an accident waiting to happen.

Just make sure that you are very certain about the directories that you are working in. I would also make sure you have good backups if you are going to be doing disk/file IO. Things can go wonky if you are a little to confident (as I pointed out to a guy when I made him spend the entire day restoring the servers he destroyed from tape).

Have fun. Let us know if you need help.

xiamenese · October 20, 2010, 1:15am

I use PDFPen Pro. Very useful for doing things like splitting PDFs, filling in PDF forms … it also connects direct to my scanner for scanning and has very impressive OCR — though I would guess DTP uses the same OCR engine. My only problem with it is that the OCR doesn’t yet include Chinese, though the developers tell me that is on the list for inclusion … though no doubt it will come at a significant price!

On the other hand … I don’t have DTP and given what I say above, if you’ve laid out your hard-earned spondulicks on that, I’m not sure that it would be worth paying $50 for PDFPen or $100 for PDFPen Pro on top of that, if there is free software that will do the splitting job.

Mark

Cadence · October 20, 2010, 6:20am

Thanks! I intend to have fun. Fun is my middle name (actually, I don’t have a middle name, but I always wanted to say that. I’m still waiting for the right moment to say, “It’s quiet in here. Almost too quiet”).
Thanks for the backup warning. I always backup with SuperDuper and Time Machine.

Hey Mark, Thank you for your OCR engine comment, made me re-check things. I just discovered DTP uses “ABBYY FineReader engine”. When I wanted to add Hebrew as a second language it didn’t have that option -???
I checked the PDFpen website and found it uses a different engine: OmniPage OCR engine.

xiamenese · October 20, 2010, 2:19pm

Interesting … does PDFPen (Pro) do Hebrew? At least that is basically alphabetical in its way … it doesn’t have a minimum of 14,400 characters to deal with … over 28,000 if you want both Simplified and Traditional Chinese.
Mark

Cadence · October 22, 2010, 4:24pm

Hey Mark, I haven’t tried PDFpen yet, so I don’t know if it handles Hebrew.
But I just got a reply from Devon Technologies: at the moment the ABBYY OCR engine they embed does not handle Hebrew. I was advised to consider purchasing a full-blown OCR package such as ABBYY FineReader or ReadIRIS including the Hebrew option if sold separately.
LOL. 14,400 characters huh?
Hebrew has only 22 letters (plus extra 5 that take a different form when appearing at the end of a word). Again → LOL!!!

Jaysen · October 22, 2010, 5:14pm

interesting thought that this brings to mind.

Mr X feel free to call me names if I am wrong, but I believe that the Chinese characters are actually words. So it really isn’t 14000 characters, but 14000 symbols that are combined to express concepts. Yet the majority of the time these symbols are used on their own to represent a complete concept.

Compare this to the complex horrors thrust on those who use alpha style symbologies. 24 letters making fairly random combinations to express ideas.

To me it would seem easier to just deal with 14000 pictures than with ∞ combinations. Just look at my name.

Sean_Coffee · October 22, 2010, 6:00pm

This isn’t what you’re looking for, but it reminds me: The best thing about Curio – which is like a mindmapping, notecarding, whiteboard kind of thing – is that it allows you to split PDF pages among its individual workspaces. Here’s a video of that. It’s a neat little feature – kind of like pinning 90 pages of a script to 90 different whiteboards and making all kinds of different notes around the edges.

So, anyway, that’s a thing.

xiamenese · October 23, 2010, 2:06am

OK, you asked for it.
Chinese characters are morphemes, not words, a morpheme being the minimum linguistic unit that carries a meaning … phonemes, individual distinctive sounds /p/, /s/, /n/, /e/ … etc. don’t carry a meaning, though there are a few morphemes in English that consist of a single phoneme “a-” as in asymptotic, for instance … phonemically /eɪ/. Again, with extremely few exceptions, Chinese morphemes are monosyllabic; English morphemes are frequently multi-syllabic “anti-” for instance — there are many issues that you may be wanting to raise here, but I won’t digress.
Written Classical Chinese is monosyllabic at word level too, so ideograph/character = syllable = morpheme = word. But given that the local “Minnan” dialect spoken here, known to the UN and the world as “Hokkienese”, is virtually unchanged in relation to the Chinese spoken in the Tang Dynasty court in Chang’an — again, I won’t digress into how this belief has been arrived at — and is polysyllabic at word level, one can presume that spoken Chinese in earlier times was also polysyllabic. Modern Chinese dialects in all their forms are polysyllabic, as is all written Chinese. It is one of the myths about Chinese that it is monosyllabic.
Another myth about Chinese is that the characters are pictures. There are a very small number of the earliest ones which are descended directly from pictographs — “man” 人, “woman” 女 (originally a woman kneeling … yes, I know …), “child” 子 (originally a baby in swaddling clothes), good 好 (originally meant “love” and is a woman holding a child), and so on — but as civilisation built, symbols to represent concepts rather than physical objects were needed, and various ways of achieving this were developed within the graphic system … result: ideographs, which is what the characters are. The result is that many of the 14,400 — the basic set developed for typewriters and used in computers — to a Western mind at least, are totally impenetrable if unknown; you cannot be sure how they are pronounced, what their area of meaning is precisely — the traditional Chinese classification of the natural world is often bizarre … crocodiles and octopodes are both kinds of fish, for instance — let alone their exact meaning. And since Chinese is monospaced, with no extra spacing between characters, you cannot be sure whether the character goes with the previous one(s) to form a word, stands alone as a word, or goes with the following one(s) to form a word. To a Chinese native speaker, they can use their fluency in the spoken language and the context to infer a meaning and pronunciation. And just to rub it home, the Kangxi Dictionary produced in the 17th century, has over 46,000 characters.
In comparison, a productive system like a small alphabet working with a set of rules is utterly simple. For a language like Italian or Spanish, it is beautiful; if you know how the individual symbols — and there are only 20-odd — are pronounced, you know how every word is pronounced. English is another matter …
The thing about English and its spellings — a fact that Webster and AOL-afficionados seem to prefer to ignore … in the case of the latter, “ignore” in its original meaning! — is not that it is totally crazy, but that the writing system is actually partially morphemic, representing different morphemes, which at the time when what we know as English, rather than Anglo-Saxon, began to become a written language for serious purposes, Middle English, were pronounced differently. So “knight” is different from “night” because in the former the “k” was pronounced, though no longer.
Now you, especially on the other side of the pond, might say that “knight” is a totally obsolete term, so why not change the other to “nite”, simply representing the pronunciation of the word, à la Italienne … well you can do it, but you lose all connection with the history of the language and it’s literature. Imagine Shakespeare written in text-spelling and such revisions! <ugh!> OK … I’m a luddite on this.
But the Chinese problem is far more severe … people, foreigners mostly, try to argue that Chinese should use romanisation and become alphabetical in writing. Apart from the question of which romanisation, and there are several, Standard Chinese (Mandarin) only has about 1,700 different syllables, even taking into consideration its four tones. A text written entirely in romanisation would be almost undecipherable as it would lose all representation of concept which is provided by the ideographs.
And you are irked by the spelling of your name as foisted on you by your parents … have a thought for some of my students. Chinese parents choose names for their children based on many criteria and scan the dictionary for characters which meet those criteria. So spare a thought for people like one of my students, who each time she writes her name has to write 魏晓慧! Hers is not the most complex I’ve seen … merely the most complex among my current students … Oh, and the middle character is “simplified”!
Sorry folks for the long off-topicking, but Jaysen, that master of diversion, asked for it.

Mark

Jaysen · October 23, 2010, 4:43pm

I humbly beg pardon for my mistaken understanding of Chinese. The few examples I have to work from are in your “rare few” set (my limited exposure outside of romanicized examples for learning).

So much to learn. So little head in which to keep it.

Cadence · October 23, 2010, 8:34pm

Wow! Didn’t know Curio could do that. I don’t own Curio (yet…), but have just downloaded Curio 7. It looks like my dream-super-whiteboard. Now I could put my ancient Wacom tablet to use again. Ancient, as in: strawberry tablet model ET-0405-U…

Meanwhile I did manage to split my pdfs. I installed the pdflabs tool that Jaysen recommended, then dared to open Terminal, cd to the pdf folder, and type the command:
pdftk mydoc.pdf burst
Worked perfectly. I batch-renamed the files with a free app I found called NameChanger, and imported them to DTP.

Mark. Your post describing Chinese morphemes is mighty impressive. Sounds like at least one of your PhDs is in linguistics. And now I’m positively discouraged from ever trying to learn Chinese!