What does MMD Import actually require of a text file?

gr · February 16, 2008, 3:21pm

I am trying to diagnose cases where feeding a plain text document to Scrivener’s MMD-import results in the error message “The file is not a valid multimarkdown file. Nothing was imported.”

Can someone tell me what the operative conditions are, i.e. what conditions the routine is testing for? I have figured out a few of the gotchas, but it has become clear that there is at least one more!

Thanks,
Greg

BACKGROUND: I am working on a workflow like the following. Some ordinary text is placed in a Word document, a script is run to massage the text to prepare it for MMD import, the file is saved as Text Only. Then the result is imported into Scriv using the MMD import function.

The case at hand is one where the document is just plain text from a Gutenberg Project document, and the script has done little but insert some # marks are the beginnings of certain lines. The script has also performed the following get-ready-for-MMD-import functions: i) make sure there are not headers, footers, footnotes, endnotes, comments and if there are destroy them, ii) crush all CRLF and CR characters to simple LFs. In an earlier script these moves had seemed to turn the trick, but now I have cases where these moves are evidently not enough.

HISTORY: I had (thought I had) worked through this problem with an earlier script. There I learned that one needed to strip out all headers, footers, footnotes, comments, but most importantly, to force all CRLF and CR characters to simple linefeeds. [By the way, the need to crush all CRLF and CR is a real bummer, because i) Word does not have an option to save to .txt with just LF line breaks, and ii) it is impossible to accomplish in any way except programmatically (e.g., cannot be done thru Word search & replace). So, if there is any way the importer could avoid requiring this, that would certainly be a delight.]

gr · February 16, 2008, 5:32pm

Here is a small file which evidences the sort of trouble I am having.

This two paragraph file was produced in the following way: I took a file of about 80,000 words which was giving me the import problem. Using a binary search procedure I arrived by process of elimination on a pair of paragraphs that occur near (but not at) the end of the document.

The elimination procedure made clear that most of the 80,000 word original is unproblematic for import, so the import routine is balking at something very local within these paragraphs.

The attached file is a text file (Unix LF style) produced by TextWrangler. The Scriv-MM import fails on it. Both of the paragraphs generate the import failure when taken singly as well.

Any clues on this particular case?

Unless the MMD import routine has a schlock filter which these lines from Anna Katherine Green are triggering, I cannot see what the issue could possibly be.

If paragraphs like these can precipitate an import failure, you can see why I am anxious to see if I can get info (if feasible) on the general parameters a file must meet to make it through the import process!

Thanks for anything and everything.

Best,
Greg

Scriv 1.11. Tiger 10.4.11. Powerbook G4. Brain 0.34b
mmd-import-problem-case.txt (1.36 KB)

signinstranger · February 16, 2008, 6:45pm

It’s probably just a minor bug in Scrivener. There has to be at least one # section heading # in an MMD file.

gr · February 16, 2008, 7:37pm

That is very helpful. Unfortunately, it is only my sample file which could have run afoul of this one-# requirement-there must be at least one other hitch.

We can get the sample file I posted to import if we insert some # headers, but my original (large) source file (which I chopped down to the sample file) contained many # headers but would not import.

So, what this shows is that my troubleshooting (binary) search procedure needs to be redone to with this at-least-one-# requirement in mind to uncover what else is going on. I will have another go at it and post again.

Thanks for the insight!

–Greg

gr · February 16, 2008, 8:12pm

The improved search procedure led me to a line, two paragraphs away from the text of my earlier sample, which which contained an occurrence of the term:

            mÃ©nage.

So, there was one high-bit ascii character in the book text and that is what Scriv’s import routine appears to be choking on.

Right now that means that I will have to add the following step to my workflow: open my script-prepared text file result in TextWrangler and tell it to crush any high-bit characters to low-. Something of a bother.

Maybe the import routine just presupposes all low-bit ascii at some point and throws an exception when it hits my Ã©. A more permissive MMD import would seem a virtue, so I will drop a note to the wish list (bug report?).

Thanks again for the help which in end helped me get a lock on the real problem.

–Greg

KB · February 16, 2008, 8:32pm

Actually it imports as UTF8 so I’m not sure why this would cause a snag. The necessity for a hash thing is known and is on the list to address for the next update (hopefully).
Best,
Keith

gr · February 16, 2008, 11:28pm

Ah, it is expecting UTF-8. You probably alerted me to this last time I was troubleshooting an mmd import script. My text files were in Western Roman (Mac) with Unix (LF) line breaks.

The trouble for me is that MS Word does not know how to save text files with either of these features.

I worked around this last time, by having the script force LFs through the Word document before save time. And this worked for mmd import for all the documents I have processed with that earlier script. But, as you will see, this process was giving me Unix LF line breaks, but Western Roman (Mac) encoding–which works fine with the Scriv import until you hit a high-bit character.

So, this gives me a slightly improved workaround (i.e. which involves opening the result in TextWrangler and resaving the text document with UTF-8).

–Greg

P.S. For what it is worth, my rational mind tells me it should be impossible that the Scriv routine fails on my Western/LF/Ã© but succeeds on Western/LF/no-Ã© documents (as it does). Here is why:

A hex dump of the file contents (in TextWrangler) of a Western/LF/Ã© and a UTF-8/LF/Ã© file suggests that between the two the exact same encoding of the text content is used, including for things like Ã© --at least as it goes for my test file. Scriv rejects the former and accepts the latter.

Now, evidently Scriv is not reacting to them differently because of the actual text encoding being different-it isn’t.

And we know Scriv is not reacting differently to them just because it sees the one is not a UTF8 file (i.e., just on principle). We know this because it happily imports Western/LF/no-Ã© files with no trouble.

This makes it quite mysterious (to me) why the one will import and the other not.

Now maybe i) TextWrangler is misleading me about the content encoding – or ii) the existence of high-bit characters in the text changes something in the header/resource area of the file (so that Scriv expects something in the non-content area that it does not find there (in my files) but only when there are such characters in the text body. But I wouldn’t have thought these two possibilities very likely.

KB · February 17, 2008, 12:02am

Try opening the file in TextEdit, using File > Open and selecting UT8 as the encoding, and see what happens there. I believe it is possible to batch-convert the encoding of text files, though I don’t recall how - Amber will be able to fill you in if she sees this.
Best,
Keith

gr · February 18, 2008, 1:51am

Thanks, Keith. Yes, using TextEdit or TextWrangler I can definitely put the UTF-8 blessing on my text files.

–Greg

P.S. At the moment it does not appear that one can put this step under Applescript control.