How An “Index-Creating” Algorithm Might Work

subgeniuszero · March 12, 2017, 1:45pm

I was thinking the other day how Scrivener’s developers might attack the problem of including an “index generation” algorithm (and workflow) into Scrivener, such that it could generate an index as part of “back matter” for a book in the same manner that it can currently generate a TOC for a book using “Copy as TOC” in the Edit Menu. Here’s what I came up with:

The user sets out with the intention of creating a book that has an index. They have to start out with this intention, as it will affect the way they approach creating their Project.
They set up their Project Keyword list with the intention of creating a list of Index keywords. Each keyword will be an entry into the eventual index, with child-keywords being sub-entries, etc. Each Keyword must match exactly the text to be indexed within the finished Draft.
Once that is done, the user selects “Create Draft Index Document” from, say, the “Project” or “Documents” menu and places it in, say, a “Back Matter” folder in the Binder. The program creates the file and auto-populates it with all the Project Keywords with placeholder tags to indicate where the page-numbers will be filled in. The user applies formatting to the Keyword list and page-number tags as they see fit.
In the Compile dialogue, assuming there is an added section to select Back Matter as well as Front Matter, the user selects their appropriate Back Matter folder, and selects “Use This File To Create An Index” next to the file they’ve placed in the Back Matter folder.
The User clicks Compile.
The program then compiles the Draft, but with a twist: As it compiles the Draft, it scans the Draft for the Keywords in the Index file. It notes and stores each page number that each Keyword appears occur on, each time for each occurrence of each keyword, and then appropriately outputs these page numbers into the index file, thus generating the index. Then, it throws away the compiled Draft, and then it compiles a second time: This time, it compiles the Draft, including the newly generated index file placed accordingly where it belongs in the Draft.
The user gets an output file with a nicely generated Index.

Thoughts on this?

—Andy H.

derick · March 12, 2017, 2:44pm

This is pretty much how WordPerfect has done it since the late 80s. Unfortunately this and every other automatically-generated keyword index only really generates a starting point for a process that has to be humanly curated to be very good. Professionals get $6-8 / page for indexing.

For a sense of what’s necessary to do it right see press.uchicago.edu/Misc/Chic … mplete.pdf

subgeniuszero · March 12, 2017, 3:32pm

Well, obviously, I just mean as a starting point. By no means do I mean that Scrivener’s output could be used as a final, be-all, end-all index for a document. But it would be a nice way of getting started, without having to resort to MS Word’s method of doing an index, which I personally loathe. It could even have a special “Index Mode” for the Keyword list, with a column for “See” references, just to make things interesting, and “child” keywords could be used for subheadings.

gr · March 13, 2017, 5:39am

I have sometimes dreamed of things that could be done with more kinds of specialize inline annotation. Facilitating manual indexing is one of them. If Scrivener’s inline annotation and footnoting functionality was iterated to make a third sort of inlining possible, one could have inline indexing. Any point in a text which an index entry should reference could be given an inline index annotation whose inline text simply was the text of the desired index entry (and optional subindex entry). While perhaps not crazy to implement, still it is primitive, since, without an accessible overview of extant index entries and subentries over the whole volume, it would be hard to maintain consistency and balance across a large work.* So, introducing a function like the above would no doubt not be the end of the story.

gr

Indexing for print publications is, of course, not something that could sensibly be done in Scrivener – because it must be keyed to the publishers pagination. So, the sort of indexing we are talking about may not require the most exacting standards. But still.

derick · March 13, 2017, 7:44pm

In a fit of procrastination, I actually went back and looked at how this was done in WP 5.1 (see http://www.wpuniverse.com/vb/showthread.php?37516-WPDOS-5.1-Manual-(scanned-and-searchable) for a link to the manual).

WP actually gives you two options - marking text inline (what gr describes) and writing a concordance file (what subgeniuszero describes, i.e. a keyword list). Once either or both of these are in place, you designate where in the document the index will go, then generate (i.e. compile) it.

I used this to index B.A., M.A. and Ph.D. theses of 100-400 pages each, and it worked great. The model of allowing both marked text and a keyword list seems especially helpful.

subgeniuszero · October 19, 2018, 4:12pm

So how’s about it, KB? Are you listening out there in the Phantom Zone? Could we ever see such a feature in Scrivener? This thread has been dead a long time, just thought I would try to breathe some life back into it today while I was surfing the forums. It’s a neat idea, isn’t it? Would there be any way to do this with Scrivener’s current feature set?

AmberV · October 19, 2018, 6:01pm

Scrivener already does inline indexing!

Compile that as configured, open in LibreOffice, and generate an index at the bottom of the file. You’ll see the two words marked green in the editor have been properly indexed.

One could probably also use a similar technique for .docx, and LaTeX already has the wiring set up out of the box.

(The examples have been refined and are now included in the base installation of the software. Refer to the post below for instructions on how to use them.)

gr · October 20, 2018, 2:00pm

I can see what code Word is using in the xml of the docx for index entries…

<w:instrText xml:space="preserve"> XE "tintinnabulation" </w:instrText>

…but I don’t see how for .docx compile I can specify raw xml prefix/postfix insertion.

Should I expect something like this would work if I compiled using pandoc–>docx? My attempt at that just gives an error, hence the question. (Not really a pandoc user, obvs.)

subgeniuszero · October 20, 2018, 2:48pm

For the layman, could you please explain what this is doing and how it works, in detail, step by step?

xiamenese · October 20, 2018, 3:10pm

My issue with all this is that I wouldn’t want Scrivener to do everything for me. Whatever I compile, I’m going to run through Nisus Writer Pro—yes, I’m a Mac user! —anyway. In fact, with Scrivener 3, I have compile set to open the file automatically in NWP. I do that because there are many things I want to check following compile because I spot things that I’ve missed because of the different interface. And NWP has a long-established indexing system already in place—and I can invoke Bookends scanning and bibliography building with a click on a menu entry. I could probably use Word just as well, except that I don’t feel at home in it and I’m not sure about the Bookends part.

I suppose it’s just that I’m old enough not to want everything immediately, and sitting back while a macro runs to mark all the Chinese in my text as being in Chinese and setting the font I want accordingly and letting Bookends do its scanning and building the bibliography at the end of the document is a bonus. And then if I wanted an index, using NWP to help me do the indexing is merely another process done at that stage.

I guess it’s a matter of age and expectations.

Mark

AmberV · October 20, 2018, 4:09pm

Yeah, you can insert raw XML into the .docx file with Pandoc—very similarly to how I inserted raw ODT XML in the example project. For reference, here is what the ODT style settings look like:

Prefix:

`<text:alphabetical-index-mark-start text:id="<$n:index>"/>`{=odt}

Suffix:

`<text:alphabetical-index-mark-end text:id="<$n#index>"/>`{=odt}

It’s the bracketed syntax code on the end that tells MMD or Pandoc to only insert this text verbatim into f/odt files (and omit it entirely otherwise). We can use the “openxml” code to insert XML for Word files. The code you provided is enough to define the term, but you also need to wrap special fields like this in a fldChar element.

The only problem is that unlike ODT, marking a word as indexed like this removes it from the output. So I need the styled phrase to be both inside the syntax and outside so that readers can see it. To do that we use a much simpler Style prefix/suffix of “%%” as a placeholder, and then handle the actual syntax expansion with Replacements:

Replace:

%%$@%%

With (the $@ symbol is inserted twice, once within the field and secondly outside of it):

`<w:r><w:fldChar w:fldCharType="begin"/></w:r><w:r><w:instrText xml:space="preserve"> XE "$@" </w:instrText></w:r><w:r><w:fldChar w:fldCharType="end"/></w:r>`{=openxml}$@

As far as I can tell from the documentation on XE field type, as well as field codes in general, that is the desired result. It looks a bit funny in LibreOffice though, in that the field occupies a space directly preceding the word. When using LibreOffice itself to index words, the word is a visible part of the field. So I’m not sure how Word displays index markings normally. I’d be curious to see if the compiled result from the updated demo looks “normal”.

I tried using the syntax to print a field value, but that was ignored.

But if it looks okay in Word and works as expected, I can get these modifications added to the stock MMD/Pandoc compile formats.

(The examples have been refined and are now included in the base installation of the software. Refer to the post below for instructions on how to use them.)

AmberV · October 20, 2018, 4:24pm

Basically:

Open the attached sample project on my previous message.
Open File ▸ Compile… and select from either MultiMarkdown->ODT or Pandoc->DOCX (you’ll need Pandoc installed to see the latter).
Compile.

To index the terms in the editor, you need only mark the words with the supplied “Index Term” style. As for the fact that MMD/Pandoc are used for document production, it is of minimal impact to how one writes, as I have the checkbox enabled that converts rich text formatting to Markdown syntax.

It’s the result that will be a little different. This way of working is not about formatting at all. While Pandoc can take a Word template file to set up document formatting, it’s not like normal Scrivener where you can paint font colours everywhere and do whatever you want and have it in the .docx file.

But the result is fully wired up with styles, so formatting isn’t difficult—and surely less difficult than manually indexing every time you compile. So that’s the balance.

I recall you’re one to spend hours on the pixel alignment & colours in table settings though, so this whole semantic-first approach may not be your cup of tea. I figured it’s worth noting however that, strictly speaking, Scrivener can do this—and here is how.

And if it works for you, save the desired compile Format and implement the Index Term style in your project.

I guess my point of view on the matter is that if you’re going to be automating some of this to some degree somewhere, whether it be in a macro language in a word processor as baked into your Scrivener compile settings, it’s not much of a difference in terms of procedure. The main factor is familiarity. I’d have to bootstrap myself from scratch to write an NWP macro right now, but I know how to squeeze XML into the output using Scrivener like I know how to tie my shoes.

gr · October 20, 2018, 8:39pm

Pretty sure this literally implies you tie your shoes with XML. While somehow that doesn’t surprise me, you will understand that we now need to run the captcha test on you (again):

[_] I am not a robot.

–gr

p.s. Thanks for .docx insight. I will have a look at the test doc. Strategically, I would note that enforcing the identity of text phrase and index entry prevents indexing anything other that specific terms that occur on pages and also disallows subindexing. So, I am not sure the extra work there is for the best. Or even if I did work an Index Item style like this, I figure I would still need another style set up in the less clever way (index item not included in resultant body text), so as to get that latitude in indexing (indexing is often more about locating ideas which may or may not correlate with and be well-indexed by the occurance of certain words).

gr · October 20, 2018, 11:36pm

Okay, a reinstall of my pandoc took the kink out, so I was able to try this out.

The docx indexing works like a charm. Thanks for thinking further about this!

A) Perhaps with an eye toward unanticipated ramifications (and if there was not some reason you switched the order), you might want to put the $@ in the front in the Replacement rather than after, since that will make the resulting XML match the ordering of things Word uses for index code. When I tried it the other way it seemed to work just as intended (also). I don’t know, one could imagine that, if MS is thinking the XE is tagging what comes before the tag, it might make a difference in behavior in edge cases – e.g., what page number gets calculated for the index when pages break. Maybe. Possibly.

Thanks again for thinking more about this.*

best,
gr

Though I know you probably could not help yourself, since ‘latex’ and ‘pandoc’ and ‘algorithm’ are all trigger words for you.

AmberV · October 22, 2018, 5:38pm

Don’t get me started on that “I’m not a robot checkbox”! If you aren’t hooked into a lot of Google cookies they put you through the wringer. On the other hand, when you get harassed for the fifth time in a row with a “click on the car” puzzle, it is reassuring to know that the world’s largest data miner doesn’t know who you are. That’s me, always living on the bright side of the dark side.

Indeed, this need is addressed in the LaTeX Non-Fiction template. I set that one up with two styles, one to mark words and phrases for direct indexing, and a second to specify index keys directly. That’s probably the only practical way of doing it, other than always typing in the key separately.

The updated attached project now supports both marking readable text for indexing, as well as typing in your own index key wherever you want. In the example we have two insects being indexed, but these keys will not otherwise appear in the output text, as the style they use wholly wraps the text into syntax rather than having any part left as normal text.

I’ve also added a strip code to the formats’ Replacements, to remove whitespace on either side of the key. As you can see in the example, it might be desirable to leave a little space around the key for readability.

Subindexing would require a little more digging; probably best done with some additional Replacement type features. For example if the Index Key style were written like “Insects:Roaches”, then a replacement could expand that out to whatever XML was necessary to express that. Probably not best for a self-documenting example in a stock Format though.

It would also be possible to do section level indexing, I believe—even using the Keywords feature if you wanted to. Consider putting the <$keywords> tag into a Section Layout’s “Prefix” tab, for example. That will generate a comma-delineated list of terms, which a combination of styles and replacements could convert into the necessary XML codes.

(The examples have been refined and are now included in the base installation of the software. Refer to the post below for instructions on how to use them.)

gr · October 23, 2018, 3:59am

I am pretty sure that there is no extra foo necessary to express that – so I am thinking the addition of the Index Key style will handle this option too.*

gr

Not where I can test this right now though.

derick · March 6, 2019, 9:16pm

Lovely post here on the virtues of manual indexing:

illuminationsmedia.co.uk/in … ing-about/

Badja · March 21, 2022, 3:37am

I am a newbie on Scrivener, but I find it amazing that it doesn’t have an inbult index function. It doesn’t have to be perfect as long as the index can be updated manually as a last step before releasing the document. A tool that does 90% of the work would be acceptable.

I bought Scrivener as it was recommended for larger pieces of work (Books, Theses, etc.), but it appears to me to only really useful for works of fiction. As Bill Bryson once said, any non-fiction book without an index should go straight in the bin.

Badja · March 21, 2022, 3:40am

PS Thanks for the ODT work around, but it’s still a work around. Looking at your example now.

AntoniDol · March 21, 2022, 7:48am

InDesign creates multiple effectieve indexes for your non-fiction books when you Layout the text for Publishing.

Scrivener was built for writing, not for Publishing.