manipulating compile output

I’ve been trying to use scripting languages (PHP, Python or Perl) to do stuff with my content after compiling in Scrivener and I keep running into obstacles. I’m using either HTML or Epub output because those seem to be the only formats that preserve internal and external hyperlinks.

Ideally from a scripting language I’d like to be able to access all the elements of the manuscript: folders, documents, sub-documents, notes, synopses, metadata, links, etc. The generated HTML is not very nicely structured, so I would have to do a pretty crazy amount of guesswork and parsing to extract what I want that way.

I’m starting to think I need to go straight to the underlying RTF. I’ve never worked programmatically with RTF, so I don’t know if there are decent libraries in PHP, Python, or Perl for manipulating it. There isn’t an user-accessible API available in any of the Scrivener versions, is there?

If you’re asking about a Scrivener API that allows access to all of a project’s internals? No.

If I was going to do this sort of thing given Scrivener as it currently exists, I would use Python and the elementTree HTML parsing (and generating) library. You should be able to do anything with the HTML or epub output (which, as I understand it, is HTML or XML, yes?) that could possibly be done. That library is quite complete. I’ve used it before in a professional context and it was everything I needed.

http://effbot.org/zone/element-index.htm

The problem is that the HTML generated by Scrivener is not easily parseable. I think it would require only slight modification to make it much more usable. Here’s an example:

<body>
<div style="width: 600px; margin: 0 auto">
<p class="p1"><b><a id="doc4"></a>FolderA</b></p>
<p class="p2">Here’s some text in FolderA.</p>
<hr>
<p class="p4"><b><a id="doc3"></a>DocA1</b></p>
<p class="p2">Here’s some text in DocA1.</p>
<p class="p3"><br></p>
<p class="p2">And some more text in DocA1.</p>
<hr>
<p class="p4"><b><a id="doc6"></a>DocA1a</b></p>
<p class="p2">Here’s some text in DocA1a.</p>
<hr>
<p class="p4"><b><a id="doc5"></a>DocA2</b></p>
<p class="p2">Here’s some text in DocA2.</p>
<p class="p3"><br></p>
<p class="p2">And now a <a href="#doc3">link</a> to DocA1.</p>
</body>
[/code]It would be much more usable with:

*  Tags surrounding documents, for instance: [code]<div class="doc" id="doc3">
   Document contents...
</div>[/code]Right now you can guess where a document begins by looking for the anchor, and guess where it ends by looking for an HR tag followed by the next document's anchor, but that's quite clumsy.
*  Semantic markup, like [code]
<title>FolderA</title>
[/code] or [code]
<p class="title">FolderA</p>
[/code] instead of [code]
<p class="p1"><b><a id="doc4"></a>FolderA</b></p>

Any chance of getting something like that?

Thanks,
Sigfried

Hmmm… How did you produce that HTML?

And what are you ultimately trying to do? Examine the text of a document and understand where in the Binder hierarchy you are? Something else?

And realize with elementTree, you never actually parse anything. elementTree does all that. You just ask it to give you what you want, like ‘class=“p2”’ nodes.

I produced the HTML just by compiling to HTML (and extracting the relevant parts).

What I’d ultimately like to do is take published output from Scrivener and put the pieces in places of my own choosing, not supported by Scrivener’s publishing options. I find Scrivener to be a wonderful platform for composing, but not particularly helpful for creating the output I want.

I know that elementTree and other parsers do the actual HTML parsing and put stuff into a DOM (Document Object Model) that I can then manipulate. The problem with Scrivener’s HTML is that the objects I can extract through something like elementTree are not the objects I care about – namely, documents, titles, meta-data, etc. The useful objects that Scrivener manages have been mangled and merged with formatting into HTML objects that don’t help me much at all.

Have you looked at the Export, rather than Compile options? Export will give you individual files, while Compile assumes that you want to mush everything together.

Katherine

Yeah. In my experiments with it, export doesn’t capture hyperlinks, which makes it unusable for me. Am I missing something?

I don’t know if it’s worth it, but I’m back to thinking my only hope is going straight to the Scrivener project files. The .scrivx XML file looks pretty usable. The only part (at first glance) that doesn’t look somewhat self-explanatory, though, is links (and probably comments and other highlighting). Continuing the example I started showing code samples of above, here are the relevant bits of the RTF file and the .links and .scrivx files showing a link between two documents:

RTF file where the link appears. No evidence here of the link. Filename: “5.rtf”

[code]{\rtf1\ansi\ansicpg1252\cocoartf1138\cocoasubrtf230
{\fonttbl\f0\fnil\fcharset0 Cochin;}
{\colortbl;\red255\green255\blue255;}
\pard\tx720\tx1440\tx2160\tx2880\tx3600\tx4320\fi360\sl288\slmult1\pardirnatural

\f0\fs28 \cf0 Here’92s some text in DocA2.

And now a link to DocA1.}[/code]

The link is described here, but in a way I don’t understand. Filename: “5.links”

<?xml version="1.0" encoding="UTF-8" standalone="no"?> <Links Version="1.0"> <ScrivenerLinks> <TextLinks> <LinkID Range="38,4">3</LinkID> </TextLinks> </ScrivenerLinks> </Links>

Extract from “test.scrivx”. BinderItem ID 6 is the document linked to, ID 5 is the document linked from.

<Children> ..... <References> <Reference BinderID="5" Destination="[Internal Link]">DocA2</Reference> </References> <Children> <BinderItem ID="6" Type="Text" ...> <Title>DocA1a</Title> <MetaData> <IncludeInCompile>Yes</IncludeInCompile> <NotesTextSelection>0,0</NotesTextSelection> </MetaData> <TextSettings> <TextSelection>27,0</TextSelection> </TextSettings> </BinderItem> </Children> </BinderItem> <BinderItem ID="5" Type="Text" ...> <Title>DocA2</Title> <MetaData> <IncludeInCompile>Yes</IncludeInCompile> <NotesTextSelection>0,0</NotesTextSelection> </MetaData> <TextSettings> <TextSelection>52,0</TextSelection> </TextSettings> </BinderItem> </Children> </BinderItem> </Children>

Can anyone explain how these links work? How do I figure out which part of the text the link is coming from?

Thanks

I spent a little time trying to figure out how links work, but to no avail. Hopefully someone can provide a clear explanation.

I do really like the Scrivener interface for composing books, but the publishing options are too limiting. I’m close to deciding I need to start an open source project to make something that has some of Scrivener’s composition features, but a lot more flexibility in the output: printing, PDF, HTML, ePub, but a small, sensible markup-type language for placing and formatting objects like titles, page numbers, chapter/section numbers, notes, synopses, meta-data, etc.

Because Scrivener for Mac is the product of a single developer (me), we don’t have the resources to write our own HTML parsers and suchlike. The HTML generated by Scrivener is Apple’s built-in OS X HTML exporter.

As for the .links file format, it is fairly straightforward. “3” means that there is a link starting at character 38, spanning four characters, and that the link points to the document with the internal ID “3”.

Scrivener is really intended as a first draft tool, so there really aren’t too many “publishing” options - for that you would want to move to a page layout or publishing tool for the final stages. It is more than capable of producing epubs and PDF files for self-publishing novels and suchlike, but is not intended for more complex layouts. On the other hand, if you are looking for a markup-type language, you may wish to look into MultiMarkdown, which allows you complete control of the output.

All the best,
Keith

Thanks. Character 38 of what? The underlying RTF? The printable-character-only content of the RTF? Something else? It looks to me like it’s the 38th printable character of the RTF. But I don’t know RTF well enough to understand how you count printable characters.

Also, the link appears in both the .links file and in the .scrivx file, but with two different offsets. What’s that about?

What I want to do is self-publishing, with nothing any more complex than an ePub. If I could some kind of decently-formatted access to the whole content, it would be easy. The XML format used in the .scrivx file would be perfect if I could also easily get all the individual documents with links.

One thing I have not yet tried is compiling to HTML with file separation between all documents and combining what I get from there with information from the .scrivx file. I’ll give that a whirl as soon as I get a chance, but it would be a big help if you could give a more detailed answer to the link question.

Thanks,
Sigfried

Sorry, yes, of the rich text itself - that is, of the text after it has been converted from RTF syntax to text as it appears in the editor.

The links for individual documents don’t appear in the .scrivx file - that may be for the project notes.

Why can’t you use the .epub export as it stands? That supports links already.

Unfortunately you would need to parse the RTF for that.

All the best,
Keith