Request JSON export capability

AmberV · January 23, 2023, 5:14pm

Better XML output

As far as that goes, as noted above Scrivener can execute a post-processing chain for you, essentially resulting in a direct method, as you put—it’s one where you have ultimate control over the conversion rather than hoping that whatever assumptions we make are what you require. Refer to §24.21, Processing, in the user manual PDF for further information. So at the very least you may be able to automate what you’ve been doing by hand.

The trick though is that the compiler itself does not generate OPML natively, so you’d be needing to start from a different point to get to where you want to go with that method. It’s worth considering MMD → HTML. It’s a different DTD, but you’d at least be starting with the same tech, one with unfathomably more support than OPML—and more importantly the formatting would be a lot more expressive. OPML is essentially just a plain-text dump into attributes.

Using Markdown as a compile file type doesn’t necessarily mean having to start with Markdown either, in case you’re worried about that. The compiler can convert a fair amount of formatting itself, and with a little prompting on our end via styles and section layouts—well, we can do quite a lot. But if you don’t mind writing in Markdown, I think you’ll be pleased with what Scrivener offers along those lines—even to the point of maybe skipping some of the below and just building a Lua output for Pandoc that does all of this from an abstract source.

Native output

As noted by nontroppo above, Scrivener itself is quite capable of generating output as syntax. We put a lot of effort and design into that side of Scrivener, so that people could come along and build custom file type converters—so that we don’t have to spend vast amounts of time doing so ourselves, and result in cluttering the Compile for menu beyond reason. Our design target for this was valid XML, which is going to be enough for most other syntax rules as well.

Between having Pandoc around and its dozens of support file type outputs, and Plain-Text with its DIY syntax generation capabilities, there is little you can’t do yourself, or orchestrate the doing of with utilities.

To circle back to the original request for one moment though: while you are absolutely right that JSON is like supporting XML export these days, and that it would benefit developers and the like who use Scrivener, the main problem I see with that approach is what that even means beyond the basic structure. Would an outline chunk be one long string with \n separated paragraphs, or would it be an array of paragraphs containing metadata about para style and such for each line? What should be used to mark up inline formatting within paragraphs? Or should it be more hardcore like Pandoc’s JSON AST, which is essentially unusable for a human as each word is considered a discrete structural component. I don’t think there are going to be universal answers to that question. JSON is a way of structuring data, not defining the application of that data, or how it should be implemented—and I’m not aware of there being any kind of common “Book JSON” specification.

Ultimately I fear it would be a “good for one person” type result, either that or so laden with compile pane options that we might as well just stick with what we’ve got and leave all of the answers to these questions up to you. With the exception of a little },{" type glue in between things—what more could we be doing that doesn’t cross over into presumption?

And the glue is the easy part—so in a way we’re already fulfilling the feature request to the best level we can.

Practical example of a JSON compile format

json_output.tar.gz (68.7 KB)

To that end, here is a POC that in my testing gets us to 100% valid JSON output, and that contains useful markup information at a granularity of paragraphs. Of note:

I’ve employed a few workarounds to get around some Windows bugs that involve how Section Layout prefix/suffix strings are merged with paragraph style prefix/suffix strings, when adjacent.
Info for Windows users...
- I’ve had to add a newline after the last paragraph of each section. This avoids causing the final paragraph to not acquire the styled prefix/suffix it needs to wrap the paragraph in JSON structure. The final paragraph becomes the last empty one, which is ignored anyway by the compiler, as should be.
  
  The implication to be aware of though is that one would need to be more careful with how they format content in the text editor—to be mindful of how each line is going to be transformed.
- One that I couldn’t fix is that Windows doesn’t handle custom date-time formats very well (for the created/modified time stamps). The output is kind of garbage, so if one were to be using Windows I’d suggest just using one of the stock placeholders instead.
Paragraph styles that aren’t body text would need to be added to the format’s Styles pane, and given a similar prefix/suffix treatment to regular paragraphs. I’ve provided one example of that using Block Quote. You will note that in the output of the first subsection, “Scene a”, we get two paragraphs in the array with their type set to “blockquote”.
Inline styles are a conundrum. One could argue that paragraphs containing inline styles should be broken down into another array, with each particle of that paragraph denoted for its semantic intent. That may be a bit much for the compiler to handle though as it would require a logical revision to the container from a simple hash key associated with a string for content to an array of typed strings, meant to be glued together by the post-processor:
Sample code...
```
"content":
[
	{
		"chartype": "normal",
		"content": "Beginning of the pagraph ",
	},
	{
		"chartype": "emphasis",
		"content": "some text stated emphatically",
	},
	{
		"chartype": "normal",
		"content": ", the rest of the paragraph...",
	},
],
```
We could maybe do that with Scrivener all by itself, but I bet it would really be easier to just mark up the paragraph string and post-process it somehow.
On that, one approach I would favour even though it adds dependencies to the post-processing chain, is using Markdown to mark up inline formatting in the strings. The advantage is that with Pandoc you can take a common source string and go pretty much anywhere you want with it. But if you have a certain single-format processing requirement, like HTML, then I suppose it would be just as easy to bake that into the compile settings. You wouldn’t be stuck at least, since all of this markup is coming out of the compile settings one way or another.

Either way, this can largely be done with Styles, using the prefix and suffix values (see “Emphasis” in the Styles pane). But another compile format pane to be aware of is the Markup pane, available to TXT output alone. I’ve only provided one simple example of adding Markdown links to internal cross-references (that’s the trickiest one as we are using the <$linkID> placeholder in conjunction with its usage in the Section Layouts to establish the value of the id key for each section chunk).
This all is a pretty simple example, but you can see that by using Section Types, and purpose-built Layouts, you could create additional structural designs beyond the array-of-paragraphs example. Since the paragraph style is being applied via the Section Layout itself, each layout can use whatever paragraph style it needs to format paragraphs to its intent.

So with all of that demonstrated, if what I was saying earlier didn’t make sense, hopefully this better illustrates what I mean about the implementation being too wide-open once we get past the bit about satisfying JSON syntax validity. But let me know if I’m missing something.