Trouble with multifile pandoc export on large projects

jandavid · April 17, 2019, 8:36pm

I frequently manage to make Scrivener freeze when exporting very large projects via pandoc to multiple files using the otherwise very nice ruby splitter script from here:
https://literatureandlatte.com/web/forum/viewtopic.php?f=2&t=52114
and I did specify the encoding as suggested here:
[url]https://forum.literatureandlatte.com/t/postprocessing-script-encoding-problem/43973/1]

I’ve tried a number of things in order to isolate the problem, and I’ve come to the conclusion that it’s indeed simply the fact that my project is too large, and that pandoc can’t handle it.
It’s a book-length project with 10 chapters and many many sections and subsections.
The splitter script is exactly as provided above with the only difference that instead of MMD

  # `multimarkdown -t latex -o #{filename} #{tmpfile.path}`

I am using pandoc, so the last line looks like this:

  `pandoc -f markdown-auto_identifiers -t latex --biblatex --top-level-division=chapter --columns=120 -o #{filename} #{tmpfile.path}`

That has been working fine and is still working fine as long as the number of chapters/sections to export is limited.

But my project has grown very large. Up to three or four chapters it exports fine. But if I dare to add more chapters Scrivener freezes and I get the spinning beachball (I’m on macOS) and have to force quit.
I’ve also experimented and selected all 10 chapters with only one section plus subsections selected and that works withouth problem.

So it really looks like it’s simply that with too much data it chokes.

Also, the markdown file is produced fine and then the first two, sometimes three tex files pop up but then at some point it freezes and doesn’t finish.

My question then, any suggestions what I might try to tweak to make it work again?

nontroppo · April 18, 2019, 12:33am

What happens if you run the splitter script from the terminal manually rather than via Scrivener’s compile? Run from the terminal you should be able to easily redirect any errors or other issues to a log file for examination; and you should run Pandoc with the –verbose flag to get as much info as possible. If you can get a reproducible hang with a specific file, this can be reported to Pandoc issues tracker…

I doubt the problem is “too much data”; each split section should trigger a separate Pandoc process (though I can’t access the splitter script thread and can’t remember exactly what it does)…

jandavid · April 18, 2019, 7:08am

OK, I’ve tried that (i.e., I saved the script as an .rb file and then ruby myscript.rb input), where “input” is the markdown file produced by Scrivener.

I get a number of “Note with key ‘cf63’ defined at line 378 column 1 but not used.” warnings, for all the footnotes, and a warning “Could not load include file ‘filename.tex’ at line 1 column 29” (I do have some literal latex code in there including raw code blocks to later \input external files). But I don’t get any output at all? I’m probably doing something wrong here. Would I have to modify the #{filename} #{tmpfile.path} arguments?

I also doubted it was the quantity of files. So, first I tried unselecting sections in Scrivener’s compile and then adding them back until it failed. I thought I had found the section that caused trouble to inspect it further. But then I compiled only that section and it worked fine. Then I added that section but deselected a number of other sections and again it compiled.
I don’t want to rule out that all of the sections I tried have some issues like some special characters or something and with a few issues it doesn’t choke but if there are too many it freezes … but so far I’ve been unsuccessful in isolating any specific problem.

BTW, this is the correct link to the original splitter script, for some reason the one posted before was broken:
https://forum.literatureandlatte.com/t/compiling-to-latex-one-file-per-section/41136/1

nontroppo · April 18, 2019, 10:26am

Try this splitter.rb — ever so slightly modified to try to catch all output and send it to the terminal for display:

#!/usr/bin/env ruby -wU
# encoding: utf-8

require 'tempfile'
require 'open3' # ruby standard library class to handle stderr and stdout

SPLIT_MARKER = '>>>> '
FILE_EXTENSION = 'tex'

# Split the input file by markers, writing the contents into temp files. Access temp file with file_chunk['name_of_file.txt']
file_chunks = {}
metadata_block = nil
input = ARGF.readlines
input.join.split(SPLIT_MARKER).each do |chunk|
	# Store the first chunk as MMD metadata, to be added to each temp file
	unless metadata_block
		metadata_block = chunk
		next
	end
	next if chunk.length < 1

	filename, *lines = chunk.split("\n")
	next if filename[-4..-1] != ".#{FILE_EXTENSION}"
	tf = Tempfile.new(filename)
	# Add metadata to top of each temp file; unless we're in the reference list
	tf.print metadata_block unless filename == "references.#{FILE_EXTENSION}"
	tf.print lines.join("\n")
	file_chunks[filename] = tf
end

# Pull out the references.txt temp file. We will append it to the bottom of each document that we process. It contains figure and footnote references. Pandoc will ignore any that do not apply to the section, so this can be done blindly.
references = file_chunks.delete("references.#{FILE_EXTENSION}")
file_chunks.each_pair do |filename, tmpfile|
	references.rewind
	tmpfile.print references.readlines.join("\n")
	tmpfile.close
	cmd = "pandoc --verbose -t latex -o #{filename} #{tmpfile.path} 2>&1"
	puts "\n:: Running: #{cmd} ::\n"
	Open3.popen2e(cmd) do |_stdin, oe, thread|
		while (line = oe.gets)
			puts ':::: ' + line.chomp
		end
		exit_status = thread.value
		puts ":: exit status: #{exit_status.to_s} \n"
		unless exit_status.success?
			puts "!!!---RETURNED non-zero value---!!!"
		end
	end
end

Something weird is definitely going on…

jandavid · April 18, 2019, 6:37pm

I really appreciate your time helping me troubleshoot!
So, this is what I get with your script (from the command line):

:: Running: pandoc --verbose -t latex -o Chapter_1.tex /var/folders/11/yyzl685d0h11s7ctdldn02pc0000gn/T/Chapter_1.tex20190418-71057-1mwjc6y 2>&1 ::
:::: [WARNING] Could not load include file ‘file.tex’ at line 1 column 35
:::: [WARNING] Note with key ‘cf11’ defined at line 195 column 1 but not used.
:::: [WARNING] Note with key ‘cf12’ defined at line 199 column 1 but not used.

… (a bunch of those warnings)

:::: [INFO] Not rendering RawBlock (Format “html”) “”
:: exit status: pid 71061 exit 0

:: Running: pandoc --verbose -t latex -o Chapter_2.tex /var/folders/11/yyzl685d0h11s7ctdldn02pc0000gn/T/Chapter_2.tex20190418-71057-1clbufi 2>&1 ::
:::: [WARNING] Note with key ‘cf1’ defined at line 238 column 1 but not used.
:::: [WARNING] Note with key 'cf10’ defined at line 274 column 1 but not used.
…
:: exit status: pid 71063 exit 0

… and so on for all the 5 chapters that were included.

Also, one more observation:
Just playing around with it further, I have recreated a Scrivener document with only two chapters (plus subdocuments) which compile fine.
I have then duplicated these chapters 5 times.
It freezes when compiling.
Then I moved the duplicated chapters to make them all subsections of the second chapter.
It compiles!

This made me think it does seem that it somehow has to do with the quantity of chapters (i.e., splits)

However, and here comes the weird part:
I have then moved them back out to chapter level.
Deselected everything in the compile window except for the 7 (top-level) chapters and one subsection.
Again it compiles fine.

I’m at a total loss …

jandavid · April 18, 2019, 6:48pm

Further info:

I think it’s reproducible. (I’d be curious if others could test this on different machines.)

(1) Download Ioa’s nice splitter script from https://forum.literatureandlatte.com/t/compiling-to-latex-one-file-per-section/41136/1
(2) Duplicate the two sections in each, the “red book” and the “black book” nine times so that each book has ten sections.
(3) Then duplicate red and black book so that you get ten top level folders (with ten subsections each)
(4) Compile, click Pandoc syntax, and in the script change the multimarkdown line to
pandoc -f markdown-auto_identifiers -t latex --biblatex --top-level-division=chapter --columns=120 -o #{filename} #{tmpfile.path}
(5) Hit compile. It will freeze and you’ll get the beachball … at least that’s what I experience on my system …

nontroppo · April 19, 2019, 12:24am

…and works fine on my system

I actually created 20 top-level folders not 10 to really push it, creating a 1.5MB markdown file. Ruby takes 8.1 seconds to split this into 20 73kb .tex files. Here is the project: www50.zippyshare.com/v/VZdGEyQ4/file.html — note it will create a splitter.log so you can check the output.

I normally use the latest Ruby (V2.6.2), but by using rbenv, I could also easily test the system Ruby (V2.3.7), and it also works. I’m using Pandoc V2.7.2

jandavid · April 19, 2019, 7:22pm

So it seems I’m finally making some headway …

Using the modified script from the 20-chapter scrivener file works on any of my projects no matter how big. So now I’ve tried to figure out one by one which of the settings screwed it up.

The one thing that seems to have caused the problem was the lack of 2>&1 in the arguments field.
So even with the original splitter script from the other thread, it always works as long as I say 2>&1 no matter on what file (either from some of my own projects or the original project with the 10 black and red books).
But as soon as I omit 2>&1 it fails. I can’t claim I understand why, but that seems to be the issue … or at least one issue.

I say “seems” because the 20-chapter scrivener file from your link above does work no matter what, even if I leave the arguments field completely blank (I thought it’d be unnecessary because according to the manual is provided anyways and the output is generated through the script, so in my compile format I kept it blank … and provided the file wasn’t too big it worked fine.)

So what does 2>&1 do? I googled, and it seems to have to do with redirecting error messages to stdout? (But I don’t quite get how that could have an impact on whether or not it fails. BTW, the .log file remains empty on the successful runs, and if it fails I doesn’t even get to the point where it writes it.

Also, you did add a number of other things to the script, maybe to make it more robust, may I ask what they do? Specifically, require ‘open3’ at the beginning of the script, which seems related, and putting the pandoc command into cmd = “pandoc …” and the additional stuff at the end? The original script still works without that, as long as if I specify 2>&1 but I’m still curious.

In any case, does any of this makes sense to you? Why I run into trouble otherwise?

nontroppo · April 21, 2019, 2:08pm

Hm, that makes no sense to me; as you found out 2>&1 simply redirects any messages sent to stderr to the standard output, it shouldn’t affect the process itself. I’m also surprised your log remains empty, redirection should absolutely create output. It suggests something is amiss with your shell environment. Can you reproduce this in the terminal?

> ./splitter.rb splitter-test.md 2>&1 >> splitter.log

vs

> ./splitter.rb splitter-test.md

???

I use open3 as it is considered a more robust way to run processes in Ruby, ensuring you can catch any problems and print them out.

jandavid · April 22, 2019, 12:28pm

I do think it’s something in the original script that created trouble, not necessarily my system (although I’m in no position to judge). Here’s another set of attempts at figuring out the problem.

First, as you requested:

./splitter.rb splitter-test.md 2>&1 >> splitter.log

gives me

./splitter.rb:41:in `block in <main>': undefined method `rewind' for nil:NilClass (NoMethodError) from ./splitter.rb:40:in `each_pair' from ./splitter.rb:40:in `<main>'

and

./splitter.rb splitter-test.md

gives me

=== ------------------------------------------------------ === === Splitter V1.0 Report @ 2019-04-22 10:11:31 +0200 === === ------------------------------------------------------ === Working directory: /Users/me/Documents/Academic/Writing/troubleshooting Initiating with Ruby 2.3.7 Pandoc: /usr/local/bin/pandoc | V: 2.7.2 ./splitter.rb:41:in `block in <main>': undefined method `rewind' for nil:NilClass (NoMethodError) from ./splitter.rb:40:in `each_pair' from ./splitter.rb:40:in `<main>'

That seems fine, right?

I’ve then tried to systematically test all variables (and combinations thereof) again, first with the original splitter scrivener project downloaded from the other thread, then with the one you’ve provided above.

In the original project, top-level folders (called Red and Black book) have been duplicated to produce a total of 20 separate files.
I have left everything in that script as provided except for the last line (and when using pandoc also checking the “Pandoc syntax” checkbox) and the arguments field in the Scrivener dialog.

First, I’ve experimented with the arguments field left blank (as it was in the original).
This is what I get for each of the commands (the last line of the script):

(a)

  `multimarkdown -t latex -o #{filename} #{tmpfile.path}`

works without problem (produces all the 20 .tex files).

(b)

  cmd = "pandoc --verbose -t latex -o #{filename} #{tmpfile.path} 2>&1"

works but does not produce any .tex file.

©

  cmd = "pandoc --verbose -t latex -o #{filename} #{tmpfile.path}"

works but does not produce any .tex file.

(d)

  `pandoc --verbose -t latex -o #{filename} #{tmpfile.path}`

freezes (spinball).

Then I have added <$inputfile> 2>&1 >> splitter.log into the arguments field.

(e)

  `multimarkdown -t latex -o #{filename} #{tmpfile.path}`

works without problem, produces files but log remains empty.

(f)

  cmd = "pandoc --verbose -t latex -o #{filename} #{tmpfile.path} 2>&1"

works but does not produce any .tex file, log remains empty.

(g)

  cmd = "pandoc --verbose -t latex -o #{filename} #{tmpfile.path}"

works but does not produce any .tex file, log remains empty.

(h)

  `pandoc --verbose -t latex -o #{filename} #{tmpfile.path}`

works without problem, produces files (!) but log still remains empty

Now moving on to your Scrivener project:

(i) as provided it works fine (obviously). The log file reads:

[code]=== ------------------------------------------------------ ===
=== Splitter V1.0 Report @ 2019-04-22 11:10:54 +0200 ===
=== ------------------------------------------------------ ===
Working directory: /Users/me/Documents/Academic/Writing/troubleshooting/orig script/nontroppo_orig
Initiating with Ruby 2.3.7
Pandoc: /usr/local/bin/pandoc | V: 2.7.2

:: Running: pandoc --verbose -f markdown-auto_identifiers --biblatex --top-level-division=chapter --columns=120 -t latex -o 1-RedBook.tex /var/folders/11/yyzl685d0h11s7ctdldn02pc0000gn/T/1-RedBook.tex20190422-51634-1j9wcqj ::
:::: [WARNING] Note with key ‘cf100’ defined at line 1092 column 1 but not used.
… etc.
[/code]

(j) same setup but with nothing in the arguments field.
All works fine (but no .log file)

(k) put <$inputfile> 2>&1 >> splitter.log back into the argument field and changed the line
cmd = “pandoc …” to:

  `pandoc --verbose -t latex -o #{filename} #{tmpfile.path}`

I get an error message "The file could not be created: There was a problem generating the file using my-script-2"
and also a separate Error log:

[WARNING] Note with key 'cf100' defined at line 1084 column 1 but not used. ... a bunch of those warnings ... [INFO] Not rendering RawBlock (Format "html") "" /var/folders/11/yyzl685d0h11s7ctdldn02pc0000gn/T/my-script-2:46:in `block in <main>': undefined local variable or method `cmd' for main:Object (NameError) from /var/folders/11/yyzl685d0h11s7ctdldn02pc0000gn/T/my-script-2:40:in `each_pair' from /var/folders/11/yyzl685d0h11s7ctdldn02pc0000gn/T/my-script-2:40:in `<main>'
This produced the very first of the .tex files, but no other files. And here I do get the expected .log file with the content:

=== ------------------------------------------------------ === === Splitter V1.0 Report @ 2019-04-22 11:06:20 +0200 === === ------------------------------------------------------ === Working directory: /Users/me/Documents/Academic/Writing/troubleshooting/orig script/nontroppo_pp Initiating with Ruby 2.3.7 Pandoc: /usr/local/bin/pandoc | V: 2.7.2

(l) same setup but with nothing in the arguments field.
Same result without .log file.

So, long story short. Your version of the script does work (which is what I’m now using).
However, I’m still curious as to what the original problem was. That script does work if error reports are redirected as in (h) above…

nontroppo · April 23, 2019, 2:23pm

The very first sample running from the command line is creating an error in Ruby (NoMethodError), that is not fine. The references block is empty (nil in Ruby) causing that error. That seems like a markdown file problem.

Whenever you run with the backticks `` like your (e) or (h) the log file will not be generated *that is expected I think). And just having cmd = “…” will not work without the Open3.popen2e(cmd) do |_stdin, oe, thread| … end loop that actually uses the cmd.

My take home is that Ruby’s backtick syntax pandoc ... causes the hang (doesn’t affect multimarkdown for some reason, but not so surprising as it is a much simpler system), and using Ruby’s popen2e to run the command works (which is what is recommended to run commands via Ruby anyway). Glad its fixed anyway!

jandavid · April 26, 2019, 7:06am

Awesome, thanks so much for helping me figure this out!

ptram · July 8, 2024, 9:54am

Out of curiosity: could the name of the split file include hyphens? Something like “next-section.qmd”? The script is stripping out everything, including these markers, that would be useful for readability of the file name.

Also: maybe all uppercase characters can be converted to lowercase?

Paolo