RTF control of HTML output

agavin · August 24, 2012, 4:50pm

I’m a novelist (and also a computer programmer) and I’m attempting to finagle Scrivener into serving as a live component in my ebook generation pipeline. As the current compile formatting is only about 90% there, I’m trying to combine HTML output with my own CSS stylesheet and then feed it into Calibre.

Are there any ways to do the following? or can we get some? These features would help not only ebook generation but certainly any HTML based generation. I don’t, myself, mind using CSS to add the formatting, but the markup needs to be properly structured to accept the CSS.

Is there anyway to better control the relationship between the structured autogenerated text (titles etc) and the tagged output? An easy solution would be to output various structural components (like title, nesting level, etc) in the form of multiple class names. For example:

xxxxxx

or <p class="p11 chapter-separator"xxxx

These extra classes would make it easy to attach appropriate CSS. It would also be useful to be able to specify the tag type of these auto generated structural elements (for example making titles

or whatever. Sometimes the header tags are used by other programs to detect chapter headings.

Or perhaps an easy solution here is just to have a checkbox for headers in the compile that allows for “raw html insert” and it is then the burden of the user to build up the appropriate HTML in the compile window. A raw “section layout” that is complete with HTML and Scrivener <$> tags would enable any template to be built. Sometimes desired output is something like:

Chapter 2: Dark Shadows

Two:

Dark Shadows

Near Salem, Massachusetts, Saturday evening, October 18, 1913

The

is hidden in CSS, and for Calibre’s benefit. The other lines are styled by the CSS.

Same goes for seperators. But again, a raw HTML option would serve fine.
Also, as I mentioned in a previous topic, the ability to put <$img> tags in separators would be great.
Is there any way to specify in normal RTF documents (body text) formatting that can then translate reliably into particular and

tags with controllable class? I have used color to do this for spans but it is ugly. A range of color will become some arbitrary “s1” type class that I can modify the CSS of. It would be much nicer to be able to select a span and mark it as a named class and have it appear in the editor as a more subtle indication of formatting. Same goes for specifying the class of a paragraph (div or p). Example of this kind of thing would be to mark “first paragraph of chapter”, “paragraph to put a box around” or “phrase to shift into smallcaps”. The actual formatting would be done by CSS.
As a total nicety a way to specify CSS (in a window, thru a file, or document) to insert in the style block would be awesome.
A powerful tool might also be a series of “replacements” that run on the outputted HTML. Effectively I have to do a bunch of search and replace currently manually. I might do this myself by writing a bash script with some kind of regex find/replace. Yuck. But there is a problem, the specific class tagging of different elements “moves around”. I.e. it could be class=“p8” or class=“p9” as these are incremented based on the contents. Hence ways to make these more structural and hence less content sensitive would be better (i.e. class=“chapter-separator” instead of class=“p8”).

This all sounds complicated, but overall making the HTML output more CSS friendly really amounts to only a few small changes. I have looked at MMD a little bit and it doesn’t really seem to do what one would want for ebook generation. Plus, the idea of gumming up the text of my book with formatting descriptors is pretty appalling. I like that in Scrivener my chapters are only lightly formatted and that the WORDS are all my words. Some light RTF to indicate class for span/p would be much better than some ugly MMD tag. And I certainly wouldn’t want to lose natural Italics for italics and the like. That would remind me too much of AppleWriter circa 1982 .

The current workflow, which is to spend 2 hours manually massaging the HTML after output is likewise pretty hideous. The whole purpose of the structured approach is to automate this process.

kewms · September 9, 2012, 11:17pm

It’s already possible to include raw HTML markup in a document, and to output raw HTML from Scrivener, by compiling to either HTML or plain text. That might be a more productive direction to pursue than grafting CSS commands onto RTF output.

I can’t speak for the development team, but I would guess that any feature which requires Scrivener to actually parse HTML (or any other) markup would probably be seen as a lot of work for a relatively small benefit, and somewhat out of scope for Scrivener’s intended purpose.

Katherine

agavin · September 9, 2012, 11:53pm

It doesn’t sound like you really got the point of what I was talking about. Pardon me if it’s obtuse, as I’m a programmer myself with a very diverse set of skills (from video games to compilers to databases to web apps to Mars robots and more).

What I’m talking about would be REALLY easy. It’s basically just about getting a little control and stability on what is already there. Right now, the RTF->HTML output is good, but it isn’t properly tagged for easy control via CSS. The existing class attributes are generated by what appears to be an in order discovery hash – which is natural – but means that if you change the contents of the work in even in significant ways they bump around. Simple stabilizations like adding STRUCTURE (and Scrivener is all about structure) based class names would make things much easier. For example what I talk about above in #1.

I get the job done fairly easily by having a script (I whipped one up in Ruby) which does half a dozen find and replaces in the document after the HTML is generated. But this is an awkward hack and I have to fix it every time anything changes in the doc. If the class attributes were output in a way which was stable, I wouldn’t have to, I could just slap in the CSS and go. Plus I have to do some really awkward stuff to mark spans.

If Keith or someone wants to talk to me to get the full scoop I can probably work out how to implement the stuff in extremely little coder time. I’ve talked him through other cool features before

Andy

KB · September 10, 2012, 9:29am

<$img…> tags will work in separators in the next update.

It’s not possible to specify meaningful titles for the <p…> styles because Scrivener uses the standard OS X HTML converters which don’t allow for this - I’d have to write my own HTML parser from scratch to do this, and I’m afraid I have no intentions of going that far!

You can assign <h…> levels, though, using the “Formatting” pane in Compile - there is a pop-up button there that allows you to set titles to be various heading levels, for instance.

Unfortunately, while it might look like making the HTML output more CSS friendly only amounts to “a few small changes”, those small changes would involve using a completely different HTML parser. I already have to do an enormous amount of post-processing on the Apple-generated HTML to get it to do half of what is necessary to generate decent HTML, especially for e-books, and although it’s far from perfect, this is still better than trying to write and maintain a custom HTML converter, so I’m afraid that for the foreseeable future, the only way of tweaking this stuff is manually, as you are doing now. For most users, this isn’t an issue, as the e-book files that Scrivener generates are more than good enough for most purposes, but it’s not really suited for custom control.

All the best,
Keith

agavin · September 10, 2012, 2:21pm

Ah, I didn’t realize it was generated by an existing library. That makes sense, although is a minor bummer. I wrote both directions of HTML parsers once for Flektor.com for slightly similar regions once. The outputter didn’t take very long actually, and had few problems (unlike the importer), but I wouldn’t recommend adding that to your todo pile just on the basis of this need

I’ll look at the popup!
Andy

KB · September 10, 2012, 3:53pm

Basic parsing probably wouldn’t be so bad - it’s when you get to parsing NSTextBlock and NSTableBlock, the Apple text system’s classes for handling text lists and tables, that things get really hideous.

All the best,
Keith

P.S. One thing that e-book export doesn’t have that HTML export does have is a way of including raw HTML. For HTML export, you can use “Preserve Formatting” and tell Compile, in the HTML options pane, to treat these blocks as raw HTML (these blocks get inserted into the raw HTML in post-processing code). This makes less sense for e-books - for the time being - because users are more likely to want to mix things up, using Preserve Formatting for, er, preserving the formatting of some text blocks, while maybe using it for making out raw HTML elsewhere. This is only problematic because the behaviour “Preserve Formatting” has been bent to different things for different formats. On the list for 3.0 is to mix this up with a styles system, so that you would be able to mark up different blocks of text in different ways and have them treated differently during Compile. This is all still in drawing board stages, at the moment, though - I have a long list of things I want to do for 3.0 but have yet to start on the actual code.

agavin · September 10, 2012, 4:34pm

While you are doing that, look to see if there is some easy way to at least mark spans and paragraphs with some kind of style that can optionally be invisible but result in at least different (even if not stable) class tagging in the HTML.

I use the same project and different compile templates to output my novel for several different purposes: manuscript, PDF for CreateSpace printing, e-book for quicky ebooks, and HTML for export to Calibre to make “real” final ebooks. So the “preserve HTML” doesn’t totally work for me because it would gunk up my non-html export. You could possibly look at being able to mark HTML in the document such that it’s inserted in ebook or HTML output but just snipped out of the others.

I.e. “text text text this is my funnytext in html only text text text”

Where in the non HTML formats you’d just snip out the and but leave the guts of the text. Somewhat hacky.

At current, I find that the mobi/epub output is great for just quickly making a doc, and I have it looking fairly OK, but there are certain nuances to the styling that I can’t get that way that I must have in a “production” ebook. These include the <$img> in separators (but you are adding that – please, if you can make sure that multiple refs to the same image to not result in including the image more than once) and in similar ways fine control over the formatting of images. In my real ebook I throw a class=“illustration” into the resultant tags generated by my <$img> tags. I then use the CSS to properly style/center etc. these in the ebook. I haven’t found a way to get that perfectly in Scrivener itself.

The automatic ebook creation is tantalizingly close, just not quite 100%. But I would imagine that a decent % (and growing) of your users are self-publishing authors.

Thanks!
Andy

Jaysen · September 10, 2012, 4:53pm

Just kind of following along…

I’m wondering a psudo or person markup method might not be the right answer for you. Something like
<<>> blah blah blah<<>>

As you provided your programming experience, I think you can see the plus and minus to this method. I’m not saying it’s the best, but it might just get what you need with minimal, or at least limited, effort.

If you extend it a bit further you might consider
blah blah sp.o(CL=funnytext) blah blah blah sp.c() blah blah

Again, there are + and -

I’m sure you are already listing them…

agavin · September 10, 2012, 5:03pm

Certainly something like that could work if there was a checkbox in the compile template to delete it/ignore it or pass it on so different templates could do different things with it.

Jaysen · September 10, 2012, 5:37pm

that’s the big con, you would need to “post process” the output. I’d use perl but sed/awk would be just a easy. You mentioned ruby earlier…

Anyway, the idea behind scriv, as I have internalized it, is less about “perfect output formatting” and much more about “flexibility rivaling a contortionist in your drafting methodology”. Which means that you get an amazing ability to chunk the data in to relocatable blocks, but then you get to do a bit more work in making it pretty.

Granted, any prettying I do to my writing is just another layer of lipstick…

agavin · September 10, 2012, 6:02pm

I just have a thing where I don’t believe in REPEATED prettying of manuscripts. Anything that involves outputting the file and then making it look good, in say Word, is unacceptable to me. I just won’t do it. It’s not that I object to the other program or the time invested, but to the manual nature of that, in that one would have to do it every time one changed the draft.

I look at the Scrivener project as the source code. I believe in having an automated (or nearly so) pipe to compile it into various forms. I generate and send off drafts (formatted or not) all the time and I never really see the project as 100% “finished.”

Jaysen · September 10, 2012, 8:13pm

That’s where there may be a divergence from the “scrivener way” as it was once explained to me (either by KB, Ioa (AmberV) or one of the others). Basically the use of scrivener is more like CAD where scrivener is the CAD design and the “compiled manuscript” is the result of sending your CAD file to the CAM device of choice. Once you cut the material (make it pretty) you can’t really go back to the CAD unless you are doing to cut another piece of material.

The analogy is flawed, but it’s there somewhere. Basically once you export from scriv, the scriv project would only really be SOR for major changes. It took me a while to abandon my desire to use scriv like an IDE but once I did things got much easier for me to manage.

Keep us updated on how you manage.