Implementing a NEAR search (working regex, thanks GPT-4)

Julian_M1 · May 3, 2023, 9:02am

It would be very handy if there were a NEAR search (e.g. to find where two characters were speaking to each other, as their names are likely to be close together).

Since there is no such existing option, and expecting it to be non-trivial as a regex I asked GPT-4 for the answer. It works nicely, so nicely it would seem very easy to internalise and add as an additional search type.

PROMPT

Please provide a PCRE compliant REGEX to perform a case insensitive NEAR search for two words (using placeholders XXX and YYY for the target words) in any order, up to n words apart.

ChatGPT

A near search for two words in any order up to n words apart is a fairly complex regex problem. This pattern can be solved using lookahead assertions, lookbehind assertions, and non-capturing groups. This will ensure that the words can appear in any order and that there can be up to n words in between them.

To make the search case-insensitive, you can use the i flag at the end of the regex. [Julian: Search seems to be case insensitive anyway. No idea how to force case sensitivity]

Unfortunately, PCRE does not fully support variable-length lookbehinds, so it’s not possible to create a single regex that can match two words in any order with a variable number of words between them. We can create two separate regex expressions for the two possible orders of the words.

This is what those regex expressions would look like:

For XXX followed by YYY:
\bXXX\b(?:\W+\b\w+\b){0,n}\W+\bYYY\b
And for YYY followed by XXX:
\bYYY\b(?:\W+\b\w+\b){0,n}\W+\bXXX\b
You would replace n with the maximum number of words that you want to allow between XXX and YYY. For example, to allow up to 3 words in between, you would use {0,3}.

In these regex expressions:

\b is a word boundary. This ensures that the words match exactly and not as part of larger words.

\W+ matches one or more non-word characters (generally punctuation and whitespace).

(?:\W+\b\w+\b){0,n} is a non-capturing group that matches up to n words (surrounded by non-word characters). The \w+ within this group matches one or more word characters.

The i at the end makes the regex case-insensitive.

Note that you would need to use both regex expressions and check for a match with either one to find XXX and YYY in any order. Also note that due to the limitations of lookbehinds in PCRE, these expressions might not work exactly as intended in all regex engines. Specifically, they might not work in some versions of JavaScript and other languages that don’t fully support variable-length lookbehinds.

Vincent_Vincent · May 3, 2023, 12:15pm

You could search for [any word] and handle the proximity by eye.
Or use dialog focus.
(For now, as an alternative.)

Julian_M1 · May 3, 2023, 12:20pm

No I couldn’t.

Dialog focus was a feature I was unaware of; thanks for drawing my attention to it.

GoalieDad · May 3, 2023, 12:29pm

A better way is to search text with option of All Words, then put in first word (XXX) separated by a space from second word(YYY).
The advantage is it will pull up same files and scenes as your reg ex search, BUT will highlight those terms in the file making it easy to pick them out

Vincent_Vincent · May 3, 2023, 12:37pm

My bad. I meant [All words], which would filter documents containing both, and then give what you see in @GoalieDad’s screenshot.

. . . . . . .

https://www.regular-expressions.info/near.html

reepicheep · May 3, 2023, 3:00pm

The only time I wanted to do that styl eof search I Compled the Project to PDF then created a quick and dirty corpus with the varius corpus linguistics software available (all of it open source or free to use) looking for the pattern. CWB and LancsBox being my preferance as they handle PDF. Antconc would need the output to be some sort of ASCIIfication.

Vincent_Vincent · May 3, 2023, 3:06pm

Sounds complicated.
What extra benefits would you get ? compared to the in app proposed method ?

reepicheep · May 3, 2023, 3:59pm

Could have word stems, and a bunch of other features. Here’s a summary for LancsBox #LancsBox: Lancaster University corpus toolbox but there is much more.

Julian_M1 · May 3, 2023, 4:48pm

That’s interesting… I was thinking about doing it myself in Python. I’ve already got word counting, character mention counting and was actively considering coreferencing (until I hit too many problems with neuralcoref). Packages like spaCy and other neural NLP look handy but if someone’s already done the hard work, so much the better! I’m off to check those out, thanks!

jcarman · May 3, 2023, 5:53pm

That ice highly useful!