Case transformation regexes \u and \U do not work in compile

jandavid · July 10, 2018, 10:51pm

This was discussed already a while ago here (literatureandlatte.com/foru … se#p224631), and was recommended to be posted in the Bug Hunt forum but, alas, I never did. My apologies!

Yet, here we go again, new project, same problem …

In the compile replacement tabs, when using regexes, case transformations such as \u or \U do not seem to work.

How to reproduce:
Compile a document with some words in it
Enter (\w+?) in the replace column and \u$1 in the “With” column. Check RegEx.
Compile.
Every “word” is changed to “\uword” when it should be changed to “Word”.

Thank you very much for looking into that (or helping me figure out what I’m doing wrong).

jandavid · July 11, 2018, 7:58pm

In addition, I’ve now realized that in the with column of replacements \n or \r are also interpreted as a literal “n” and “r”.

AmberV · July 12, 2018, 12:42am

According to Apple’s documentation, it is ICU compliant. It’s very close to PCRE, but it doesn’t support any special backslash transformations or string replacements. To insert a whitespace such as tab or return, you have to insert the literal string (which you can do into fields with ⌥⇥ and ⌥↩). Backslash only works for backslashes and $.

jandavid · July 12, 2018, 5:04am

You’re right, the literal return works …
that’s a workaround for now. But what about case transformations?

I don’t have sufficient programming knowledge to fully understand what you write (assuming it’s an Apple problem given the links?), but is there any workaround for changing the case of the replacement string?

AmberV · July 12, 2018, 3:24pm

Yes, the regular expression engine is a framework provided by the Mac, it’s not something we can modify ourselves. The operators supported in the replacement pattern are pretty limited—table 3 in the documentation lists everything you can do. Namely: insert capture groups with $1, $2, etc., and insert the characters “$” and “\”.

As for workarounds, using the Processing compile option pane, you could make use of other regular expression engines to further manipulate the output in ways Replacements cannot. Beyond simple command-line approaches, that can dip into programming however.

jandavid · July 12, 2018, 6:38pm

I do postprocess on the command line with pandoc. To do that and the replacements together in one go via a script would indeed be elegant. Would you be able to help me implement this? I’m sorry, I’m not a programmer, my knowledge ends with regexes and what I find on the internet. …

I have different compile formats defined for different purposes all using pandoc. Postprocessing arguments look like this

<$inputfile> -f markdown-auto_identifiers -t latex --biblatex --columns=120  -o <$outputname>.tex

or

-t docx --bibliography=$HOME/.pandoc/Bibliography.bib -M reference-section-title=References -N --reference-doc=$HOME/.pandoc/templates/refdoc-num-headings.docx -o <$outputname>.docx

I also have one format defined with the filesplitter script that you had posted here (viewtopic.php?f=2&t=52114&p=267229&hilit=split+multiple+files#p267229).
which I pasted into the script field in Scrivener replacing the MultiMarkdown command that you had provided with this:

 `pandoc -f markdown-auto_identifiers -t latex --biblatex --top-level-division=chapter --columns=120 -o #{filename} #{tmpfile.path}`

Works flawlessly BTW, I’m always impressed when I use some code that I don’t understand half of it, and it does some magic for me

I assume I could do all of the postprocessing as different scripts, where I first execute the regexes I need and then process it via pandoc similar to this one. If you could help me with a script that I could modify, where I could add some regexes, that would be awesome (I can probably construct the regexes myself.)

Thank you very much!

AmberV · July 13, 2018, 6:22pm

Well one cool thing about shell scripts is that at their most basic they can be thought of as merely a sequence of individual commands that you’d input by hand into Terminal one after the other. There is of course much more than can be done with them, but if all you need to do is run sed or something first, and then pandoc to finish it off, you can put both lines into the “Script” field. So that’s one really easy way to automate or chain together several tools.

But for simple cases, it may be better to use pipes. I provide an example of this in the Processing pane documentation, bottom of page 670. This example takes MMD output and injects it into the clipboard instead of making a file when you compile. The principle can be applied to other things however, such as:

Path:

/usr/bin/sed

Argument

-E 's/replace/with/' <$inputfile> | pandoc ...

It’s a little quirky because you’re putting the first part of the command in one field and two commands in the second, as arguments to the first, but separating path from arguments is a bit of artificial contrivance anyway. The result that is sent to the shell is “ ”, so as long as you recognise all of this will be ending up on the same line together, you can do most of the stuff you would do in a “one-liner” in Terminal.

Naturally you would need to modify the Pandoc command slightly to take standard input from the pipe, which will have the text that is modified by sed, instead of opening the original file. The output would remain the same, as you still want a file in the end, and you want Pandoc to create it.

In the case of the Ruby splitter script (glad to hear you’re getting good use out of it ), then that would be a decent place for the transformation, since we’re already processing the full text. Try something like the following. In the script, look for the line of code in the first line given below, and paste in the second line after it:

[code]…

next if chunk.length < 1
chunk.gsub!(/PATTERN/) { |match| match.capitalize }

…[/code]

Put your regular expression into the “PATTERN” spot, between the slashes, and see if that does what you’re looking for. A lot of that syntax is pretty magic and should be left alone—but that “match.capitalize” should be pretty straightforward, and you should know you can do other things there if you want. Capitalize will upcase the first byte in the matched string, which I think is what you want. But if not, let me know—there really is no limit to what can be done to the matched string.

Oh and something worth mentioning is that in the example above, the whole string that is matched gets stored in the ‘match’ variable for processing, so there is no need to use parentheses in your pattern. “\w+?” would suffice.

jandavid · July 15, 2018, 1:29am

Thank you, this is immensely helpful.

I’ve played around with it a bit and I think I can get it to do everything I need, but I’d need some more help. I tried pasting two sed commands and then the pandoc line into the script field as you suggest, but I’m doing something wrong as I get a “$inputfile: ambiguous redirect” error message.

sed 's/\\(\w+?)\{\}/\\\u$1\{\}/g;' sed 's/\\label{\S+?}//g;' pandoc <$inputfile> -f markdown-auto_identifiers -t latex --biblatex --columns=120 --top-level-division=chapter -o <$outputname>.tex

Any suggestions?
I do like the replacements tab that Scrivener provides, as it gives a great overview over all that’s going on, so having it all in the script field would be preferable over pipes, which would make the whole thing too long (with the additional benefit, that I could place a comment after each regex to remind me of what they are doing). What would be the best way to do this? A string of simple sed commands one after the other as I’m trying above? Or would it make sense to turn it into an actual script? like perl? … but I’m not sure how to pass the arguments <$inputfile> <$outputname> to the script …(which also seems the problem above) …

Some of the regex patterns I need to reproduce would be:

s/\\(\w+?)\{\}/\\\u$1\{\}/g; # Capitalize leipzig glosses s/\\label{\S+?}//g; # delete all LaTeX labels for Word export s/\\ref{\S+?}//g; # delete all crossrefs s/(\[\@\w+\:.+?\])\s\./$1\./g; # fix for extra space before period after citation. s/^#\s+?(.+?)\s+?\{\.unnumbered\}/\:\:\: \{custom-style=\"Unnumbered Heading\" 1\}\n$1\n\:\:\:/g; # convert unnumbered section to custom word style s/(^\\\w+?\[??.*?\]??\{.*?\}\s*?)\%+?\s*?(.*)/$1 /g; # convert LaTeX to HTML comment so that Pandoc ignores them (otherwise it escapes the % sign)

There’s probably a more elegant solution to this, but that’s all that I can do with my (and google’s) expertise. Could you help me properly frame this?

And then for the ruby script:

Capitalizing works, yes! But, as you suspect, the example is not as simple as I had put it in the post. I’d need to do some similar regexes like the ones above, where I exclude part of the pattern and add stuff to the matched string, or matching multiple capturing groups, like the example above putting a comment into HTML tags. And I don’t know how to translate the simple s/…/…/g syntax into ruby. …

What does seem to work is to place multiple
chunk.gsub!(/PATTERN/) { … }
chunk.gsub!(/PATTERN/) { … }
one after the other, so if I figure out how to write the correct patterns in ruby this can probably go a long way …

Thank you very much for your help!

nontroppo · July 15, 2018, 5:20am

I’m currently very short on time so can’t help with regexes, but one general point, the whole purpose of a tool like pandocomatic is that it provides a pincipled way to manage Pandoc and scripts to run… It allows you to run general setup/cleanup scripts, direct “pipe” scripts (pre and post processors that work on the raw character stream), and manage Pandoc filters (very cool functionality that works on the semantic chunks of Pandoc documents). You don’t need it, but it provide a more elegant way to combine all of these disparate elements into templates that are simply specified from Scrivener.

heerdebeer.org/Software/markdow … -templates

AmberV · July 16, 2018, 12:54pm

Before getting into the nitty gritty, I’d second the use of a prepared system like the excellent pandocomatic, as it’ll handle these details for you. But if you’d like to learn how this stuff is done, here are some tips.

Firstly Scrivener’s placeholders won’t work inside of a script because the compiler isn’t going to into a script and modify it (that would be too risky). So you’re sending <$inputfile> directly to the command line, which is in fact syntax that is confusing it (and that it is syntax is why we can’t just blindly change text inside of a script).

As you guessed, what you need to do is supply these values to the script somehow, and that is what the the Arguments field is for. Something as simple as this should work:

<$inputfile> <$outputname>.tex

From within a shell script, you can refer to arguments by $1, $2 and so forth. Thus:

sed 'pattern' $1 > tmp sed 'pattern' tmp > tmp2 pandoc -o $2 tmp2 rm tmp; rm tmp2

What this does is first run sed on the input file ($1), piping the results to a file called “tmp” in the compile folder. Next we run the second sed command using this output file as input, and pipe its result to a new file called “tmp2”. Then we run Pandoc with the output set to our designated compile name, which is stored in $2, using the second temp file as input. Lastly we clean up the two temp files (feel free to leave that last line off as you test, as it can give you valuable insight into the process if something isn’t working right. I’d stress this approach is better as an educational step rather than a solution.

With your script, you are running sed with no inputs or outputs, which is fine to do, but only if you intend to use pipes to move data around. You could for example dispense with the temp files with a single command like this:

sed 'pattern' $1 | sed 'pattern' | pandoc -o $2

The pipe character on the end means to send the output to the next command directly. Thus we do not need to provide input data save for with the first command. The middle command looks for standard input via the pipe and sends its results to standard output. The third command has a built-in output file call with “-o”, but no input, so it’s using the standard input from the second command.

⠂─────── ⟢⟡⟣ ─────── ⠂

Now on to the Ruby code, as you note you can just use this command sequentially if you need to perform multiple search and replace operations. That’s probably the easiest, from a not-having-to-learn-Ruby-programming perspective. If you find things get a bit slow in real-world usage, it’s the sort of thing that could be optimised.

As for translating s/replace/with/g to Ruby syntax, the form described above is an alternate approach that can be used when you need to do something more complicated than what regular expressions all by themselves can do. You’d want to use that for capitalisation, but for the things that are doing simple replacements, like deleting cross-refs, you can use a simpler syntax:

chunk.gsub!(/\\ref{\S+?}/, '')

Ruby documentation can be accessed on the command-line with the “ri” command. So to look up documentation on the gsub method, you can type in “ri gsub”. You’ll find usage examples in there, for doing replacements that insert stored call-backs. Ruby uses the \1, \2 format instead of $1, $2, by the way.

Here is the one that inserts a space between a period and citation:

chunk.gsub!(/(\[\@\w+\:.+?\])\s\./, '\1\.')

Hopefully it should be pretty straight-forward!

jandavid · July 17, 2018, 12:55am

Thank you very much, Amber! This is very comprehensive and, I’m learning quite a bit. I’m still running into a few problems though:

Sed:

I’ve followed your instructions and to test put

sed 's/\\(\S+?)\{\}/\\\u$1\{\}/g;' $1 > tmp pandoc -f markdown-auto_identifiers -t latex --biblatex --columns=120 -o $2 tmpinto the script field while providing <$inputfile> <$outputname>.tex as Arguments.

However, I get the following error, which I assume, is because I am using $1 to refer back to my capturing group while the same placeholder is used to refer to the input.

I tried with \1 but with the same problem. A simple replacement, without capturing group and backreference works fine.

⠂─────── ⟢⟡⟣ ─────── ⠂

Ruby:

I managed to get ruby to work … well, mostly.

BTW, just for the sake of completeness, some context to this (as you might wonder what the point of such strange macros is): I’m a linguist, and we often use so-called “functional glosses”, i.e., words/parts of words that have some grammatical function, such as PST, meaning “past”, or SG meaning “singular”. In the text they usually appear as small-caps. Now there is a LaTeX package, called “leipzig” (after the Leipzig Glossing Rules) that has the most common glosses predefined as macros (and makes it fairly easy to define your own). The package not only prints them in the right form in small-caps, but also works together with the glossaries package to create glossaries. The leipzig macros are mostly (with a few exceptions) simply the name of the gloss but with the peculiarity that they start with a capital letter (and are usually followed by {} to ensure that a potential space after the gloss is not eaten up). So to get the small-caps PST gloss in LaTeX I would type \Pst{}, SG is produced by \Sg{}, etc.

So now with Scrivener 3 I’ve defined a character style for glosses, which (a) in Scrivener has small-caps activated so it looks as it should and (b) when compiling for Word the style gets a prefix “[” and a suffix “]{.smallcaps}”. Now (c), when compiling for LaTeX while it needs a prefix "" and a suffix “{}” and initially I wanted to supply these directly in compile, but, since I also need to capitalize, and therefore need to postprocess anyways, I figured it’s probably better (to avoid false positives) to give my Scrivener style a more unique prefix and suffix so other potential \LaTeX{} macros don’t get messed with, so now my gloss prefix is and my suffix , hence I need something that turns word into \Word{}. After some more googling, I finally came up with this, which works

  chunk.gsub!(/<gl>(\S+?)<\/gl>/) { |match| '\\' + $1.capitalize + '{}' }

(not sure it’s the right way to do it, but it does the job).
Yay, I wrote my first ruby replacement pattern …

Out of idle curiosity:

I’ve experimented a bit and the one thing I didn’t figure out was how to add \n newlines to the replacement pattern. For example this replacement. I thought either

  chunk.gsub!(/^\#\s+?(.+?)\s+?\{\.unnumbered\}/, '::: {custom-style="Unnumbered Heading 1}:::\n\1\n:::')

or

  chunk.gsub!(/^\#\s+?(.+?)\s+?\{\.unnumbered\}/) { |match| '::: {custom-style="Unnumbered Heading 1}' + \n$1\n + ':::' }

would do the trick, but … they don’t.
This particular example is actually unnecessary, since pandoc passes unnumbered headers quite well to latex on its own … I’ve constructed something like that for my other workflow to Word, but as I was experimenting with the ruby replacements a bit I wanted to see what it all can do and got stuck there. Which brings me to …

⠂─────── ⟢⟡⟣ ─────── ⠂

Pandocomatic:

Yes, I fully agree, this would very likely be a much better and much more elegant solution … in fact, when I started to set up my first Scrivener 3 project a few weeks ago, I downloaded pandocomatic and also scrivomatic, and played around with that workflow a bit. Yet, I must admit, it was a bit overwhelming, so I ended up abandoning it again. I will likely never need the vast amount of options, and I though if I can get it all done with styles and a few regexes after, I’ll be fine. Also, at least I’ll know more or less what I’m doing. So it felt like it would be overkill for my needs, as well as make it harder for me to troubleshoot and tweak to my needs. Please don’t get me wrong, this is not intended as a criticism of those tools, or their documentation, but rather a criticism of my own limited skills

However, then the problem with Scrivener (or rather Apple) not being able to capitalize a word in the regex replacement appeared … and … here we are, and now I have to tweak a ruby script and all kinds of other nasty things where I’m equally lost how to troubleshoot …
So, in the end, maybe I should have stuck with pandocomatic and invested the time necessary to learn all that. I leave that to the experts to judge. Maybe over the weekend I’ll give it another shot …

In any case, I feel like I’m now so close to getting it working, hopefully I can figure our the few remaining problems. Thank you very much for your help!

jandavid · July 19, 2018, 2:31am

I do need to take advantage of your generous help once more concerning the ruby script.

So, I was able to add all replacements I needed with the gsub method, but there is one that is a bit more complex pattern where I’m not making any progress, even with google and online help. Using Scrivener’s style to add and to my “glosses” doesn’t yet take into account possible sequences of them, such as PST-NMLZ=LOC, which end up as
pst-nmlz=loc
where I need them to be
pst-nmlz=loc
essentially making sure that the tags enclose every word and not just the entire span. The glosses themselves are only word characters, usually separated by -,= or a space (i.e., non-word characters).

In perl, I can do this:

s{(<gl>) (.+?) (</gl>)} {($o,$t,$c)=($1,$2,$3);$t=~s/(\w+)/$o$1$c/g; $t}gex;
which does exactly what I need.
Is there a way to translate that to the ruby gsub syntax so I would be able to include it in the other script as well?
Thank you very much!

AmberV · July 19, 2018, 1:30pm

Oops, sorry, I thought I sent this off yesterday but it looks like I left the page in draft mode.

Thanks for the background! That’s all very interesting and I’ll make note of the specific use cases for anyone that might be on the same path in the future.

Case transformation is something that I use for the Scrivener user manual as well—I do something very similar with Ruby to get uppercase abbreviations like “RTF” and “PDF” into lowercase small-caps. I find it more expedient to just be able to type those kinds of things in naturally, without a special style, and handle them in post. For one thing it means I can detect when such a word is in a particular context, like within a Markdown heading block, and suppress the behaviour as it otherwise causes odd looking ToC entries, like “Exporting to pdf”, where small caps can’t be used.

With sed you are better off with \1 notation for backreferences. It looks like it might take dollar sign notation as well, but within a script you might need to backslash the dollar sign.

As for the error itself, there are two issues here. One is that sed doesn’t use \S, and sed also does not support lazy quantifiers. You’ll want to find another way to ensure the pattern doesn’t over match, but here is a way to search for one or more not-whitespace characters, using the variant of sed available on macOS:

\\([^[:space:]]+)\{\}

Since it seems you have familiarity with Perl-style regex syntax, it might not be a bad idea to switch to using Perl oneliners for this stuff? Your call—but I myself use sed for basic stuff, but if I run into problems I’ll switch over to Ruby or Perl instead. They are more resource intensive, but I rarely need something that is optimised to the gills. (In fact, from the update in the other thread it looks like you already have done just that!)

Okay, with the Ruby problem, you’re close! Some of your strings are not quoted though. Make use of double-quotes if you intend to use special characters like “\n” in the string. Here is the approach I would take, which is a little cleaner I think:

"::: {custom-style=\"Unnumbered Heading 1}\n" + $1 + "\n:::"

Note I’ve double-quoted the prefix and suffix so that I can move the carriage returns into them and left the $1 variable alone and unquoted (as it is already a string). Do note that if you double-quote you’ll need to be more careful of what might be special characters within the string—such as that double-quote in the string itself, which I’ve escaped (and by the way, shouldn’t there be another double-quote literal in that example, after “Heading 1”?).

Now for simple search and replacements like the above, you don’t need to use the curly bracket syntax (it looks like you might have discovered that already). That was necessary in our other example, where we needed to basically execute a very small and simple “program” on the result, to take the matched word, store it in a String type variable, and then call a method on that string that capitalise it.

If what you are doing is a simple replacement, then you can supply the replacement string as a second argument in the parentheses. It would probably be clearer to use a very basic example, than to put your example into it:

chunk.gsub!(/(\w+)ly/, '\1est' + "\n")

That is a lot closer to the “s/pattern/replacement/g” form you’re looking for. By the way do note that Ruby uses \1 format for referencing, and those must be in single-quote strings, while you need double-quotes for newlines. So your above example could be supplied as the second argument like so:

chunk.gsub!(/PATTERN/, "::: {custom-style=\"Unnumbered Heading 1}\n" + '\1' + "\n:::")

(Maybe someone will correct me on that score. I’ve been using Ruby casually for almost twenty years, and some of my habits are at times old fashioned.)

And to follow up on your last message as well, you could approach that problem this way:

chunk.gsub!(/<gl>(.+?)<\/gl>/) { $1.gsub(/([^-=\s]+)/, '<gl>\1</gl>') }

So here we’re first looking for, and storing into $1, any block of text found within the gl element, passing that along to a block, and then converting all sequences of characters other than =, - or spaces to have a gl element wrapped around them. In case it isn’t obvious, both the chunk and $1 are string variables, and strings in Ruby have the .gsub method available to them. In this way we can call gsub on the previously matched string from a regular expression.

The .gsub! method is shorthand for “chunk = chunk.gsub”. That’s why we don’t use that format inside the block on $1, as we don’t really need to transform $1 in place, since the result of that method is the modified string, passed back to the original .gsub! call.

jandavid · July 19, 2018, 4:45pm

This is really great! Thank you so very much indeed for your thorough explanation and help. Project fully set up and all working smoothly. I wish I knew a bit more of programming … always amazed and all that is possible, in usually fairly simple/straightforward ways.

So yes, I did figure out that perl would do what sed didn’t, and, yes, perl is the one language that I’m a tiny bit familiar with So the code I posted in the other thread does most (again here for posteriority)

#!bin/sh perl -pe '# s/replace/me/g; s/replace/me/g; ' $1 | pandoc .... and whatever else one might need ...
does most of what I need.

About ruby, I really appreciate your detailed explanation, and I’ve learned a bunch. Your code for wrapping the individual words with tags works perfectly … and might actually be useful in other scenarios where one could use a style with a prefix and suffix and then upon export might need to apply these to single items within it. Instead of the “not -,=,space” expression [^-=\s]+ one could also use \w+ in this case (as glosses should only have word characters).

chunk.gsub!(/<gl>(.+?)<\/gl>/) { $1.gsub(/(\w+)/, '<gl>\1</gl>') }

So these all are great resources - it seems I’m well equipped now to deal with future problems
Thank you so much for this amazing support!

AmberV · July 19, 2018, 6:23pm

You’re quite welcome! Glad to hear all of the wrinkles are smoothed out, and that you’ve got what you need to implement adjustments going forward.

Learning a little scripting is a good investment of time in my opinion, for those inclined. Even just a little bit can go a long way. If you’re interested in Ruby, a couple of good books are Programming Ruby, by Dave Thomas. It has a good tutorial and is otherwise an invaluable reference. The Well-Grounded Rubyist, by David Black, is aimed at beginners, with a good focus on the elements and philosophy.

That’s a good idea, and it can be taken far. If you look elsewhere in my splitter script, you’ll see “do” and “end” calls—that’s the long form for what we’ve been doing here. One could just as well use this:

input.gsub!(/PREFIX(.+?)SUFFIX/) do |match| ... end

Where “…” can be however many lines of code you need to do a thing. One could perhaps create a paragraph style that operates as a code block, where the prefix and suffix that are added by Scrivener are removed and the whole thing is replaced by a syntax-highlighted alternative with line numbering.

aglazner · October 20, 2023, 2:12pm

I have a manuscript with numerous definitions of the sort “Word: definition of word.” and I need to capitalize the words following the colon-space (e.g., “Definition”). I know little RegEx, but the following should find these and replace them with upper case:

find: :\s*([a-z])
replace: : \U$1

This works in some flavors of RegEx on regex101, but not all. It does not work in Scrivener, and I’ve tried several variations. Advice appreciated.

xiamenese · October 20, 2023, 2:31pm

There was a thread about this quite some time ago; unfortunately, I can’t find it at the moment.

The problem is that the flavour of RegEx that Apple uses, ICU, doesn’t recognise the ‘L’ and ‘U’ flags for upper and lower case in the replace line. At the time, I also tried it in Nisus Writer Pro, which did, but it turned out that NWP uses the “Onigurama” flavour of RegEx, not ICU.

I read recently that Apple is changing their flavour to something else, but when, and whether it will apply to TextKit 1 is anybodies guess.

Unfortunately RegEx101 is more useful for Windows users, as it doesn’t include ICU, so what will work there, won’t necessarily work on MacOS.

Mark

AmberV · October 23, 2023, 6:57pm

Note there is some discussion above, that may be of use to you if the query is about compiled output rather than changing the source text, and you are using Scrivener as a platform for generating or making use of plain-text workflows such as Markdown, LaTeX and so forth. Compile Formats for plain-text outputs have a Processing pane available to them, as you may be aware, and I show how one could rope sed into the equation if desired, or any scripting language that supports this and other things.

That said, if you are using Markdown-esque approaches, the best solution to changing the format of anything (which I would include letter case under, for something like this) is with stylesheets, not changing the actual underlying text to force it.

HTML-based example...

Word
: Definition of the word

Becomes:

<dl>
  <dt>Word</dt>
  <dd>Definition of the word</dd>
</dl>

Which can make look however we want, but specifically:

dt { text-transformation: uppercase; }

There are of course similar capabilities in most other systems, not just HTML