Compile Choking on Large File

pseingalt · April 3, 2022, 1:54pm

New Project. 12 mb Scrivener file; pretty much all text.

Tried to compile to pdf. No problem. 1500 pages.
Tried to compile to .docx. No problem.
Tried to compile to ePub via Pandoc.
Scrivener choked.

Any suggestions for compiling to ePub?

After I got this message I transferred the file to another computer, this one with 16 gb memory instead of 8 mb. Tried to compile just one chapter: no problem. Tried to compile the whole thing: the beach ball keeps spinning. So it’s not:

memory
storage
Pandoc.

Any thoughts? I took a look at:

And followed the advice there. As noted above, a single chapter compiles to ePub through Pandoc without difficulty.

I suppose I could compile to html but the feature I like about the Pandoc option (thank you
Amber!) is that part, chapter and article subheadings are automatically picked up.

One option that I thought of would be to compile to text and then do a global search and replace for

“Article”

and replace it with

“### Article”

and build the ePub through Markdown. The issue that way is that the text often contains the capitalized word “Article” which is not an Article marker (i.e., \sub\subsection in LaTeX).

I was surprised that Scrivener couldn’t compile the file. I might let it run overnight to see what happens, but…maybe there’s another way.

AmberV · April 3, 2022, 3:08pm

I’d break up the process into two steps instead of one, to see where things might be going wrong. It might not even be Scrivener that isn’t compiling it (since it seems to be doing so otherwise), but Pandoc that is failing.

For best results, double-click on your compile Format in the left sidebar to edit it, if it is custom (otherwise skip this part).
1. In the format designer window, select “MMD” from the dropdown above the sidebar. If you don’t see it, click the gear button to the right and add it to the list of eligible types.
2. Click Save
Okay, you should now see Compile for: “MultiMarkdown” at the top, and your normal compile format setup still selected—basically almost everything identical to what you’re normally compiling. If you’re still using the stock “Basic Pandoc” compile format, then like I say you can skip the above, as it is set up to work with MMD and all available Pandoc types.
Compile to an .md file. If this step fails, we know Scrivener is the problem and you can skip the rest of this checklist.
All right, so if you have an .md file, it’s good to know that at this point you have exactly the same material Scrivener produces after the first phase of creating an ebook with Pandoc.

In your message there seems to be confusion about this fact—where you speak of “Markdown” being a separate type of compile that would require a whole different workflow, fixing headings and so forth. That wouldn’t make any sense. When you compile an .epub with Pandoc Scrivener creates an .md file just like this one, and then does what we’re going to do next (with one caveat I’ll mention shortly), by hand. If you have broken headings in your source material that you’re having to constantly fix one way or another, then you might as well fix it at the source of the problem and make them Markdown-friendly—but that’s another thing.
Switch to Finder, and use the Go ▸ Utilities menu command. Open Terminal.
Paste in the following command, after fixing the ‘yourname’ part to your user account folder in a TextEdit window or something:
```
pandoc --log=/Users/yourname/Desktop/log.txt -t epub -o ~/Desktop/test.epub 
```
Leave a space at the end of that command, and drag and drop your compiled .md file into the Terminal window, to have it paste its full path. Hit Return.
That may take a few minutes given the size of the input file, but see if that works.

If you get an .epub file that way, be warned it won’t be “complete”. Scrivener’s compile settings create several helper files, such as one for your metadata, cover page and any CSS you are using, and sets up the pandoc command line for you. So it won’t look “right”, but if it works that could be a clue.

If it does not work, open the ‘log.txt’ file you’ll find on your Desktop and see if there is anything promising in there. If you just see an empty pair of brackets, it unfortunately had nothing to say about why it failed.

pseingalt · April 3, 2022, 6:25pm

What I mean is that the files in this project are pure text. So, other than the use of the words “Part,” “Chapter” and “Article,” there is nothing to identify these as either Scrivener headings or Markdown #, ## or ###.

I’ll try to compile to Multimarkdown this evening and will try step two tomorrow. The Scrivener project file did Compile to Word and pdf. On another note, is it possible to specify paper size and margins for pdf compiles? Or is one limited to A4 or letter paper?

Thank you.

AmberV · April 3, 2022, 6:30pm

What I mean is that the files in this project are pure text. So, other than the use of the words “Part,” “Chapter” and “Article,” there is nothing to identify these as either Scrivener headings or Markdown #, ## or ###.

How are you getting a ToC if that is the case? It should be fairly easy to fix by using a Section Layout that adds a heading. With any Markdown-based output the compiler adds the hashes for you. Well, once you try it and open the .md file itself you may see better if that is happening or not. It is true though that in most Markdown-based projects, in the writing area, you wouldn’t see any heading markers unless you yourself type them in.

On another note, is it possible to specify paper size and margins for pdf compiles? Or is one limited to A4 or letter paper?

Some compile formats have a paper size baked in, and you would need to edit the Format’s Page Settings to modify it, but most of them will just pull from the project’s print settings, via File ▸ Page Setup....

pseingalt · April 3, 2022, 6:38pm

This is why I was amazed on the last project when a very detailed ToC appeared without marking the headings in the editor or adding hashmarks. In fact, I had to remove the hashmarks because they showed up in the ToC. This happened when using the Pandoc=>ePub option.

In that project I didn’t worry about a pdf ToC because there was already a LaTeX version. There wasn’t a version from which I could derive an ePub though.

I don’t know how a ToC could be generated (aside from “Copy as ToC” in the binder, which I haven’t used.

Part of the problem is not knowing what’s going on in the black box, which is why fixing issues is difficult.

I would not want to go through the entire project and identify headings: the project compiles to ~1500 pages in Word. That’s why I thought a global find and replace adding hashmarks would be a way towards fixing it, but not a fix itself. That is because the hashmarks would be added both where they are supposed to be:

Chapter 3 Terms

becomes

## Chapter 3 Terms

but,

In this Chapter, terms are discussed

becomes

In this ## Chapter, terms are discussed

where it’s not helpful.

I would have to hire someone to see if awk is smart enough to understand the difference.

AmberV · April 3, 2022, 6:49pm

That’s why I say using a section layout that adds a Markdown heading would be the easiest way to fix it. I suppose some might go to the trouble of marking all of their headings themselves, but that would not be very flexible. If you move an item up a level you’d have to change all of the heading levels—just like you’d have to do if you used styled “Heading 1”, “Heading 2” etc., typed-in headings in a word processor-based workflow. Neither way is terribly efficient in comparison to letting the compiler build out the heading structure for you. So that’s just basic Scrivener stuff really, not even specific to Markdown.

There are always exceptions of course, sometimes the physical binder outline might not match what you need for heading structure, but by and large with Markdown it’s a lot “easier” in the sense that you just tell the whole outline (or up to a certain level of depth) to have headings and that’s all you need to do. Scrivener adds “#” or “##” or “###” for you, depending on outline structure.

Part of the problem is not knowing what’s going on in the black box, which is why fixing issues is difficult.

I guess the first hint would be knowing there is no black box. In fact with a Pandoc-based workflow it’s just about as close to the opposite as you can get! I suppose it might seem that way if you have no idea how any of it works, but Pandoc is fully flexible down to the smallest detail—and as demonstrated above, Scrivener’s role of producing and plain and simple Markdown file that is then used by these other tools to create end file types gives you a sometimes valuable “interrupt” in the compile process that few of the other compile methods offer. You can’t stop PDF in the middle of the process and examine the precursor that ends up being turned into a PDF. It’s all or nothing—that’s much closer to a “black box”.

So often the best way of learning a new system is to start slow, and experiment. Rather than going straight into a 1,500 page epic where every experimentation is a battle against a wall of what already exists, just make a little blank project and type in a few one-liner documents. Give them names. Tell the compiler to insert headings based on those names, and see what you get. Once you get a feel for how this can work, it should make applying that theory to a large established body of text easier.

pseingalt · April 3, 2022, 6:56pm

How do you do that? How does Scrivener know that this text belongs to a different text than that text when both are unmarked?

I wouldn’t dare! Unfortunately, the whole purpose of this project is to create a collection of laws that are unindexed and in some cases unidentified, to make searching for the relevant regulation easier. Ten years ago the file was less than 1 mb in size, now it’s up to 12 mb. This is because, I suppose, lawmakers are constantly playing whack a mole.

I’ll try to experiment with shorter texts.

pseingalt · April 3, 2022, 7:02pm

I don’t understand how to achieve that. There’s a synthetic outline in the binder, but other than the existence of different parts in the binder, the chapters and articles are plain text in each of the files.

So in the binder there’s:

Book 1
Book 2
Book 3

The contents of Book 1 is a text file–no italics or mark-up of any kind. Articles are set off by two carriage returns from the previous article. Chapters and Parts are set off by three carriage returns. Divisions lower than Article are set off by a single carriage return. But it’s all text.

AmberV · April 3, 2022, 7:12pm

It’s really no different than how you would add a heading to parts of the outline, using Scrivener in any other fashion. Perhaps that is not something you have ever done before, and I assumed incorrectly:

For a simple example, say we’re using a Blank project to start with and have added three items called “Book 1” through “Book 3”, open Compile.
Select plain “MultiMarkdown” at the top, so you can more directly see what Scrivener is doing rather than the conversion engine after the fact.
Select the “Basic Pandoc” compile format in the left sidebar.
Click Assign Section Layouts, and for this basic example, select all of the types in the left sidebar (click the button to show unused as well) and assign them all to “Text Section with Heading”.

See how the preview tile has a Markdown-style heading depicted in it, as an example?
Compile that. You should get, assuming you didn’t add any text:

# Book 1 #

# Book 2 #

# Book 3 #

Now toss a text item beneath “Book 1” and call it “Chapter 1”, maybe type in some lorem ipsum text into the editor, and compile again. Now you get:

# Book 1 #

## Chapter 1 ##

Whik gronk; thung epp rintax whik jince dwint srung sernag nix la quolt sernag brul jince. Twock, quolt whik tharn dri cree gen... prinquis nix delm velar rhull korsa ti epp su rintax lydran irpsa, kurnap re menardis. Ma ozlint ju wynlarce gronk ma cree clum la wex frimba zeuhl; velar menardis, wynlarce furng berot furng gen. Thung er wynlarce wex tolaspa, srung morvit galph. Gen athran morvit... korsa, morvit menardis kurnap rintax velar teng srung vo frimba. Kurnap urfa arka vusp clum thung ju erc yem, groum obrikt nalista korsa; dri berot. Groum galph; ik, morvit ti gronk zeuhl erc nix. Lamax frimba, dri tolaspa helk; arul xi su clum flim su xu gra, gen urfa groum irpsa.

# Book 2 #

# Book 3 #

If you were to compile that to Pandoc → DOCX, open it in Word and rebuild the ToC, you’d get a four-point heading table of contents. With ePub you get the outline structure in your Contents section.

pseingalt · April 3, 2022, 8:41pm

I wish I could. Instead, I have a single file of text, anywhere from 5000-60,000 words long in “Book.”

AmberV · April 3, 2022, 9:03pm

That is why I advocate for testing with a simple sample project, again so that one isn’t battling against constraints. When I said toss a text item in the mix, I meant press ⌘N in your test project and type in a few words into the text editor.

As for the actual project, it doesn’t really sound like it is set up to work optimally with Scrivener at the moment, in general, so I suppose all of this is a bit academic until it is split up into an outline more conducive to how the compiler works. Matter of fact, such a large corpus of text in a single chunk like that could explain the stability issues too.

Just out of curiosity, why are you using Scrivener at this point, with this project? It sounds to me like you would get along more efficiently just converting this to ePub either from a Markdown document or using Calibre from some other format. Scrivener is adding a layer of complexity that seems to be entirely unnecessary if you’ve got whole books, already written, in each binder item.

pseingalt · April 3, 2022, 9:30pm

I haven’t found any program which lets you assemble different texts into a single file. I don’t know of any scrapbook-like programs that would accomplish this. In this case, though for me it’s not unusual, the agency in question has published 32 separate laws. These are contained in 32 separate files. All of these are pdf’s, some image, some text. The problem is finding relevant legal provisions. There is a lot of overlap among these 32 files, so that you might find provisions relating to conflicts of interest in five or six of them–or more. Creating a single file makes it possible to search across all of the files at once.

LaTeX is an option, I suppose. After OCR’ing the image files it wouldn’t be that difficult to format the resulting text and bring it into a simple LaTeX file using \includepdf. But as you know, LaTeX works well for print but doesn’t work well for digital publications. A 1500 page file would be unwieldy, and once you split up the files you lose the ability to search through all 32 files at once. Another way would be to convert each pdf into an html file and post them on a static web page. Search of all of the files could be accomplished using Google’s site search parameter. (“site:32files.com”)

So, Scrivener Is what I use to work on the 32 separate files before I compile them into a single file.

I have also tried to simply combine the pdf files into a single file. The result is well over 100 mb, and this is just unworkable for searching a single pdf file. Bringing those files into Scrivener and then compiling them to Word results in a very manageable file size of about 1.5 mb. Modern Word files are compressed, so the text file might be bigger.

Anyway, for scrapbook-like projects like this, I haven’t found anything that compares with Scrivener. It’s too bad it hasn’t become a standard in the legal community, because then I could simply share the project file and I wouldn’t have to worry about compile.

There’s a small commercial side to this as well but that’s just lagniappe and in the greater scheme of things, not that important.

AmberV · April 3, 2022, 9:56pm

Ah okay, thanks for sharing a bit of the workflow. It definitely sounds like a challenge having to wrangle so much of the data out of PDFs and such. There’s no good easy route from there to ePub.

LaTeX is an option, I suppose. After OCR’ing the image files it wouldn’t be that difficult to format the resulting text and bring it into a simple LaTeX file using \includepdf. But as you know, LaTeX works well for print but doesn’t work well for digital publications.

Yeah, that quandary (in skeletal form to be clear, ebooks were still on the horizon back then) is why I gravitated toward a simpler “nexus markup” approach, from writing purely in LaTeX in the late '90s and early '00s. When I came across MultiMarkdown in '05 or so, a tool capable of making .tex files, and .html and .rtf all from one common central format that was easy to write with, I knew I’d found the implementation I’d been dreaming of, and never looked back. I felt one of the main advantages of writing that way would be that while the source never changes, the quality and breadth of the output options would over time get better. That’s certainly proven to be the case. The same exact document I wrote back in 2005 can produce a much higher quality .docx file than it ever did as .rtf back then, and I can now generate dozens of different file types from it, like .epub, .odt and so on.

So it has withstood the test of being a personal “archival” format. Something that won’t suffer the ravages of software updates breaking old formats, fonts going missing, TeX-Live package updates breaking documents, standards updating and so on.

But a write-outward ideal like that is decidedly different than having to implode huge amounts of pre-formatted material into a new format. I don’t know what the best answer is for that, glad to hear to Scrivener is helping out though.

pseingalt · April 3, 2022, 9:56pm

I can also put the un-OCR’ed pdf’s into the binder under research, then correct possible errors (or things that don’t seem right) using Scrivener’s split screen feature, I also used this feature to add a chapter (as opposed to project) ToC by copying Article titles I had created while editing and then copying the headings to the beginning of that particular file. There are other projects that have split-screen functionality, but I’ve always found Word’s to be unwieldy. Yes, I can open two different files at the same time as well, but Scrivener…just works.

drmajorbob · April 3, 2022, 10:30pm

That helps us understand why you don’t want to (or cannot) split the documents.

pseingalt · April 4, 2022, 6:23am

A similar use case would be if someone wanted to make a single volume out of several Gutenberg Project non-fiction texts and were to use Scrivener to stitch them together.

The compile to Multimarkdown worked without a problem. So I’ll try the CLI commands later to Pandoc and see what happens.

At this point I might try just compiling a few files at a time and see what happens. I will either add hashmarks via a ctrl+F (will take a while) or try to find someone to write an awk script that will distinguish between hashmarks at the beginning of the line and anywhere else. Then I would have a proper formed Markdown file with Parts, Chapters and Articles.

AmberV · April 4, 2022, 10:07am

A similar use case would be if someone wanted to make a single volume out of several Gutenberg Project non-fiction texts and were to use Scrivener to stitch them together.

I suppose, but I would not use Scrivener for that job, it’s quite unsuitable and overkill, if that’s all you’re looking to get out of it. Scrivener does many great things, but stitching together files is not unique, nor is it efficient at doing it. The operating system itself, along with core system tools, is capable of concatenating files together at a level of optimisation that no software will be able to approach:

cat book1.md >> master_file.md
cat book2.md >> master_file.md (etc.)

Now what the compiler is great at doing is fundamentally and even radically changing the contents of “book1.md” on the fly, like inserting deep heading document structure based on an imposed binder outline structure, or injecting syntax around objects such as image sources, etc. If you aren’t actually using any of that though and are just using it as a stitching tool, it’s not great.

That said, if we’re talking about Markdown, phsyical stitching (even to a temp file) is often not going to be necessary anyway, since both Pandoc and MultiMarkdown can take multiple .md files as input, and stitch them together into a single output file.

pandoc -t epub -o multi_volume_book.epub current-project-*.md

The asterisk here means include all .md files starting with ‘current-project-’. So we get all of ‘current-project-01.md’ to ‘current-project-15.md’ used to create the .epub.

While Pandoc doesn’t support it directly, MultiMarkdown also supports what it refers to as “transclusion”, which is identical in theory to the \input command in LaTeX (or the <$input> placeholder in Scrivener for that matter), which as you know is a great way to keep files shorter than mammoth, and as well to keep the scope of the current .tex file relevant rather than branching off into configuration details or what have you.

At this point I might try just compiling a few files at a time and see what happens…

But you said it compiled a complete .md file successfully right? There isn’t anything much further to learn from Scrivener (there is one thing, but we’ll get to that). The next step in the diagnostic chain is to see if Pandoc can convert this .md file to .epub in one go. How that test goes determines what to test next.

I will either add hashmarks via a ctrl+F (will take a while) or try to find someone to write an awk script that will distinguish between hashmarks at the beginning of the line and anywhere else.

That is not a complex regular expression if that is all you need.

^Chapter

You will want to have real headings in your books at some point, so that wouldn’t be wasted time. What you’re producing right now may likely crash ebook readers or make them run very slow, anyway. It may also be the source of the problem, but we don’t know that yet, so I didn’t want to suggest fixing it until other easier to solve issues were ruled out.

In short though, the internal .xhtml files of an .epub are designed to be roughly chapter-length. From what you suggest in your description here, you’ve got whole “books” in these. Small ones, to be fair, but still well beyond what anyone would reasonably call a “chapter” for some of them.

pseingalt · April 4, 2022, 4:34pm

What does the caret in front of “Chapter” mean?

AmberV · April 4, 2022, 5:41pm

It means to start the search pattern at the very beginning of the line, so this simple string will only find cases where a capitalised word “Chapter” starts the line. You could make it safer, or more complete with additional parameters though. For instance:

Replace

^(Chapter \d+)$

With

## $1 ##

This one ensures that what follows the word chapter is a space and then some amount of digits, which must terminate the line (the dollar sign is the opposite of the caret, so both together means the pattern must describe the entire line to be valid). If by chance you need letters instead of digits (Chapter Twelve), then replace \d with \w.

To replace that correctly we need to capture the result though, since the digits will be variable. Before we could just replace with “## Chapter”. The parentheses do the capturing, and the $1 prints it.

P.S. If you try using Project Replace for this instead of regular document Find, be aware of a bug that causes the tool to hang with regular expressions, when searching in empty values. Just disable all the checkboxes but “Text” and maybe “Title” and you should be fine.

drmajorbob · April 4, 2022, 6:03pm

Is it a Scrivener bug or a 3rd party tool bug?