Regex help

I’m trying to use a regex search of documents in Script Mode to find scenes where two characters speak.
(Like LUKE and HAN)
I know the standard Scrivener search doesn’t support boolean AND;
I also know that you can iterate searches by searching ALL and then searching SELECTED from results.

Ideally, I’d like to do this with a single regex search so I can automate it.

This works to find one name:

(?=.*LUKE\n)

However, this does not work:

(?=.*LUKE\n)(?=.*HAN\n)  

I’ve tried other lookaheads, but nothing seems to work.

I’d appreciate any illumination. Thanks.

So you want to search for a text like: LUKE and HAN\n

You don’t need anything as fancy as lookahead/behind, your regex is failing for several reasons: you use only lookaheads without intervening characters, you would need lookbehind+pattern+lookahead; you put \n twice even though it would only occur once…

Anyway here is an example that should find this:

regex101.com/r/2f5FKh/2

That online editor is invaluable, it tells you how a regex matches, reminds you of the syntax and much more. It is using the PCRE engine that should be the same as Scrivener, though you’ll need to confirm it works. I use ^ and $ which anchors to start and end of a line, but I forget how Scrivener sets it regex globals (it assume it must be line based not string based)…

Thanks very much for your reply.

Some clarification:

I’m searching for documents in the binder which contain both

[code]“LUKE/n”

and

“HAN/n”[/code]

NOT

"LUKE AND HAN/n"

How would I properly form a regex search for that in Scrivener?

Ok, so we need to treat newlines as if they are just another character (so .+ matches newlines as well as any other character), this can be done using the /s modifier but I don’t know how to pass modifiers to Scrivener’s regex engine, maybe @AmberV has an idea for this?

We can easily work round that however:

regex101.com/r/ZxtONP/2

This matches LUKE\n or HAN\n across newlines, it uses [\s\S]+ to match anything including newlines. This may potentially fail however as it won’t only match start of lines, also if the character name is immediately before \n (i.e. the name LUKE happens at the end of a line without a full stop etc.) will also match. This uses a named capture group, which we can repeat using \g’name’.

EDIT, OK it seems Scrivener’s regex engine doesn’t understand repeating a previous capture, so we need to make it explicit:

regex101.com/r/ZxtONP/3

Looks like: (LUKE\n|HAN\n)[\s\S]+(LUKE\n|HAN\n)

You could then add a positive lookbehind to check for a \n also preceding the character name, but if the name is at the start of a document it would fail. Then you need another conditional which makes it more complex, so accepting you need to always finish your sentances with a full stop if they end in a character name seems a reasonable workaround…

EDIT: Ah, this may also match a LUKE\n … LUKE\n only dialog. We need a bigger group context OR so either LUKE or HAN starts first then the other comes later:

regex101.com/r/ZxtONP/4

Looks like: (LUKE\n)[\s\S]+(HAN\n)|(HAN\n)[\s\S]+(LUKE\n)

It works!

Thank you very much for both solving the problem, and educating me in the process.

Glad it finally worked! Regex as a tool is really useful, but I don’t need to use them often enough that I retain good working memory, so every once in a while it is good for me to do a “regex workout” :laughing:

@nontroppo
Thank you for sharing all this info! Could I ask about the final solve, pls?
Since my situation is a bit different to OP’s, I tried to strip down the formula and while
(LUKE)[\s\S]+(HAN)|(HAN)[\s\S]+(LUKE) works (searching Text using RegEx in Scrivener), this version does not:
(LUKE)+(HAN)|(HAN)+(LUKE).
I understand what \n is for, but \s\S seem to have multiple uses and none of them make sense to me in this particular case. Could you hint at what it does in this particular case, pls?
Many thanks in advance!
:slight_smile:
K

While you’re waiting for a more learned reply, have you looked at this site:

it can be really helpful in figuring out your regex problems. Good luck.

1 Like

Thank you for your time, @popcornflix . It’s a nice site and I’ve found other info too, but as I stated above, it doesn’t make sense to me. That’s why I hope for @nontroppo 's hint/help/advice.
:slight_smile:
K

Well, in this particular case the [\s\S]+ construct was to work around the limitation of matching across lines. \s matches whitespace, and \S matches everything that is not whitespace. If you try to replace [\s\S]+ with the generic “match everything one or more times” .+ then the regex fails because . does not span across lines, and so the later pattern can never be matched.

I think the more important question is what do you want to achieve? Normally it takes me a few passes to get to an optimal regex, as you can see in the post above where I start out with a simple regex that worked in the regex editor but failed in Scrivener then had to iterate till I got a working solution. Different regex implementations have different limitations, and it depends on both the problem posed and the regex engine used to find the solution…

2 Likes

Thank you so much, @nontroppo , for explaining this to me! I’ve found the same definition for example here (Regex Cheat Sheet), but it just didn’t made sense to me. Now it does!

re: I think the more important question is what do you want to achieve?
Basically the same thing as the OP only w/o the script format. I will be able to go from here now, I believe :smiley_cat: Want to learn what works in Scrivener too :smiley_cat:

re: Different regex implementations have different limitations, and it depends on both the problem posed and the regex engine used to find the solution…
I realized that, sigh. Not a good thing, but what can we do :-/ I’ve encountered a similar problem with Python. Not all Python interpreters are equal :crying_cat_face:

Best regards!
:slight_smile:
K

The good thing about regex101.com is that it does allow you to change among several regex engines. Other nice sites for visualising and testing regex are https://regex-vis.com/ and https://regexr.com

For completeness here for others who may not know what engine Scrivener uses, from §11.7 of the user manual:

Scrivener uses the stock RegEx engine supplied by the Mac, which uses a the UTF-8 compatible ICU guidelines. ICU is mostly compatible with PCRE, which is considered to be the standard for extended regular expression syntax

The link in the docs is outdated (@amberV needs an update), the current documentation for ICU is here: Regular Expressions | ICU Documentation — though this is more for implementors rather than end users. Wikipedia’s comparison Comparison of regular expression engines - Wikipedia shows ICU as most similar to Java if you want to use regex101

The best deep dive online docs for regex IMO are from JGSoft: Regular Expressions Reference Table of Contents and rexegg that you already linked too…

Thanks! I’ll take a look and see about updating the link, and look for something maybe more aimed at users (though for such a specific thing there may not be anything).