Is there any way to look for duplicate text?

BlueMargo · November 21, 2021, 10:20pm

I had to import many notes to my project. Now I find that there are some paragraphs that are duplicated throughout many files. Is there any quick way to search for duplicates without typing into search bar or reading those texts?
(Scrivener 3, Windows 10)

JimRac · November 21, 2021, 11:05pm

Sorry BlueMargo, Scrivener’s got nothing like this, as least not that I’m aware of.

I’m also not aware of any other tools that do this–are you?

If you are, you could always compile your notes, import them into that other tool, identify the duplicates, then go back to Scrivener and delete them.

Best,
Jim

Mad_Girl_Disease · November 21, 2021, 11:34pm

How were these brought into Scrivener? As lots of separate documents, some of whose text duplicates the text that existed before the import? Do you know where the dupes are likely to occur? How much of this is there?

Are you able to distinguish between the “originals” and the imports? Do you have a pre-import backup?

Depending on the answers to those questions and the last two especially: You might try importing again, and immediately tag those in some way (kw, metadata) so that the docs that introduced the dupes can be identified, which will make it easier to keep track of where you last saw given text, or which instance you’re looking at. If you can still identify which docs are the imports, then consider tagging them. It could be useful.

Word has document compare, I think. There are no doubt online document compare services. If doc comparison would help.

drmajorbob · November 22, 2021, 5:10am

Look for “finding duplications” in §11.1 of the manual. One word in the search box is enough to start.

kewms · November 22, 2021, 5:21am

What do you have against the search bar?

AmberV · November 22, 2021, 2:02pm

Just to note: the features described in that section of the Mac user manual have not yet been added to the Windows version.

BlueMargo · November 22, 2021, 2:32pm

Search bar is great.
It is just that I know there are some duplicates - in the text not duplicate files - but I don’t know what they are yet.
These imports are some of my very old notes from an abandoned thesis that I am trying to revive, so I don’t remember the text enough to put in a search bar. Eventually I will be reading the whole thing and flag possible repetitions - than the search bar will be useful.
I was just wondering if there is a automatic solution to take some load of me.

BlueMargo · November 22, 2021, 2:40pm

I will look into all proposed solutions.
As I mentioned in another answer - the problem is not with duplicated files but in some parts of text.
I have imported some .doc files as well as copy/paste text from some old app I have been using to keep notes before anyone heard of Scrivener or Evernote.
Those are my ancient notes and several attempts of writing a thesis so I believe I did a lot of copy/paste paragraphs and got lost between versions. Now I hoped for some automatic solution to highlight those repetitions to easier handle them.

kewms · November 22, 2021, 5:18pm

Do you still have the source notes? Any tool is going to struggle to find duplicates if you don’t know what text to search for.

(How long a string counts as a duplicate? A few words? A paragraph? A page?)

BlueMargo · November 22, 2021, 7:56pm

Unfortunately I only have some of the source notes (those it .doc files).
I think the “Exact Phrase” search option will work fine for now. Is there any limit on the phrase length?

If I may express a wish at this point, it would be excellent to be able to highlight text and select ‘search for this’.

Thank you for all your remarks.

michaelhendrsn · November 22, 2021, 8:36pm

This may help:
highlight
Ctrl C (to copy)
Ctrl Shift F (to open project search)
Ctrl A (to highlight previous entry)
Ctrl V (to paste)
Enter (to search the project)

Autohotkey or similar may make this even easier.

AmberV · November 22, 2021, 9:10pm

The planned feature we have will be generally superior in every way, as it will examine the full text of the current outline chunk you’re working with and then scan for sentence-or-greater duplications of that text throughout the entire project, and generate a list of all documents containing those ranges of text, sorted by incident weight and highlighting them for you in both documents. It was a feature added a couple of years after v3 was launched though, so it was put on the post-launch list.

AntoniDol · November 22, 2021, 10:58pm

(\b\w{n,}\b)\W+(?:\w+\W+){1,m}?\1

find the same words of n characters or more within m characters using RegEx.

drmajorbob · November 23, 2021, 6:14am

… within a single document, not across documents, I suspect.

JuanVazquez · April 15, 2024, 6:17pm

Is Scriver accepting RegEx or is it has to be done over the file?

AntoniDol · April 15, 2024, 7:36pm

The normal Search and Project Search features have a search scope for RegEx search and replace

Julian_M1 · April 16, 2024, 10:41am

I was just trying that - isn’t m the gap in words, not characters? (i.e. that regex finds duplications with <=m words between instances)

That said: words is a much better counter and that’s very neat. Thanks!

AntoniDol · April 16, 2024, 1:17pm

(\b\w{3,}\b)\W+(?:\w+\W+){1,250}?\1

Works for me on a test string of about 250 (m) words.
The first and last word of a minimum of 3 (n) characters in a match are the same.
N = the minimum length of the string to match.
M = the maximum scope of the search in words.

The Regex101 explanation for this Regular Expression is:

1st Capturing Group (\b\w{3,}\b)
\b assert position at a word boundary: (^\w|\w$|\W\w|\w\W)
\w matches any word character (equivalent to [a-zA-Z0-9_])
{3,} matches the previous token between 3 and unlimited times, as many times as possible, giving back as needed (greedy)
\b assert position at a word boundary: (^\w|\w$|\W\w|\w\W)
\W matches any non-word character (equivalent to [^a-zA-Z0-9_])
+ matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
Non-capturing group (?:\w+\W+){1,250}?
{1,250}? matches the previous token between 1 and 250 times, as few times as possible, expanding as needed (lazy)
\w matches any word character (equivalent to [a-zA-Z0-9_])
+ matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
\W matches any non-word character (equivalent to [^a-zA-Z0-9_])
+ matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
\1 matches the same text as most recently matched by the 1st capturing group

Julian_M1 · April 16, 2024, 1:20pm

I was only clarifying that m was words not characters! (it said characters in the original post)

Works great for me too!