Is it possible to search for all documents in a project that have more than a certain number of words? In some projects I have a lot of documents that are empty and used for organizational purposes, or might just have a few sentences, and I want to extract those that have longer pieces in them if that’s possible.
There would be two approaches to this, one that is very precise and would only take one step to use, but it is a bit technical. The other uses basic front-end tools but requires several of steps that may require undoing if you prefer different view settings for the most part. I’ll do the easy one first:
- Using the Project Search tool, type in an asterisk and nothing else. This is shorthand for “has content”.
- Click the magnifying glass icon and set the search scope to “Text”, instead of “All”. We should now only be seeing documents with some text. No empty icon pages, and probably few folders.
- Click the “Search Results” header bar at the top of the binder sidebar to load the results into the main editor.
- Change the editor view mode to Outliner.
- Add the Word Count column.
- Click on the Word Count column twice to sort descending, thus filtering all of the short files to the bottom of the list.
Now for the geeky way:
- Set Project Settings to “Text” and “RegEx”.
- Paste in the following search pattern:
^\W*(?:\w+\b\W*){5,}
I know it looks like the cat jumped on the keyboard, but that will find all documents in the project with five or more words. The part you can adjust is the number in curly brackets on the end. That’s a quantifier for the bit before it which is just a complicated way of looking for something that resembles a word. We want five at a minimum and no less, comma, some maximum. In this case we leave the maximum blank because we don’t care if a file has 100 or 5,000 words, but you could search for files that only had between 5 and 250 words by changing it to {5,250}. If you want files with 15 words or more, then {15,}.
You might want to save that into a search collection so you can easily pull it back up without having to hunt down the regular expression again.
Also note that while using this search result, everything will be highlighted in the files you examine—after all we are searching for documents with words, any words at all, so everything that is a word will be a match. The Typing clears search highlights option in the Editor preference pane may be of some service here, or changing the search highlight colour to something less bold.
Thanks so much! The technical approach works great, but I’m finding some problems with it.
If I put 800 as the number in, I’m pulling up some documents, including one, for example, with 3500 words.
But if I alter the number to 2000 – suddenly all of them disappear. Any idea what’s going on?
It works to find document with 800 or more words, so in that case the 3,500 document would classify. A file with 799 words on the other hand would be omitted by the search. As for using large values, I’m not having difficulties, assuming I have documents with over 2,000 words that is. A search pattern with {2000,} to find only files with 2,000 words or more (like the 3,500 file) works.
Hrm, well that’s just what I’m having a problem with.
^\W*(?:\w+\b\W*){800,} : I get a bunch of documents, including one with 3500 words.
^\W*(?:\w+\b\W*){2000,} : I get nothing.
And in fact I also just noticed that even the 800+ word search seems to be missing results that should be in it… I just found a different 3300 word document that doesn’t come up in the 800+ word search.
There might be a flaw it, I found the pattern as a hint for validating web forms, small usages mostly I’d imagine, probably not anticipating +2,000 words. But in theory it should be fine, and in practice in my project it is working—that leads me to believe there must be some data difference between our projects or maybe another flag in Project Search that is creating spotty results unintentionally, like the “draft only” option, etc.? Here are my settings:
The pattern basically means:
- Zero or more non-alphabetic elements (spaces, etc. for tabs mainly) at the beginning of a line ^\W*
- Followed by 2,000 or more (?: … ){2000,}
[list][*] One or more alphabetical elements \w+ - Followed by one word boundary \b
- Followed by zero or more non-alphabetical elements (again for spaces, end of lines, hyphens, etc.) \W*
[/*:m][/list:u]
Yup, those are exactly my settings too.
One thing I noticed: earlier you said that every word would be highlighted. But that is definitely not the case. Certain chunks within the documents appear to be highlighted: looks like chunks of 800 words? Would that make a difference?
Okay, going with that clue, there may be some types of characters that do not match the \w or \W wildcards, like word-joiners, that are messing things up. I tried inserting one of those between some characters in the middle of a 1,200 word document and that was sufficient to render it invisible, and when I adjusted the range down to 500 I could see not all of the words were matching.
I don’t know if word-joiners are the thing doing it for you, but that general idea is probably the issue, and it’s something that could be fixed by changing the pattern slightly to: ^\W*(?:[\w]+\b\W*){2000,}, and then pasting these special characters into the [\w] part after the ‘w’. That would mean alphabetic type characters, or word-joiners, or (whatever else you need can be thrown in between the brackets, each character is an implied “or”, forming a set which we need at least one or more of—i.e. what constitutes the visible part of a word).
If it is an invisible one, you should be able to find the character by looking for a spot where the highlight stops, and moving your cursor through the first word, one arrow key at a time, that doesn’t match after a highlight. That’s the one that is tripping things up, and if you come across an invisible character that causes the cursor to “trip” over a spot, simply shift-arrow over that invisible spot to copy it and paste it into the designated bracketed area of the search pattern.
In my tests that cleared that specific problem up, and the 1,200 word document with a word-joiner in the middle went back to being matched and fully highlighted.
And if you come across anything else that isn’t technically a “\w” but should be included as part of what constitutes a word, you can paste it in there as well. [\w.’] might be necessary for example, that’s a literal dot and an apostrophe, which may be found in cases like “something.com”, or words like “it’s”.
Moral of the story it might take a little tuning before it works well, and secondarily, my sample text for testing things lacks common punctuation patterns for colloquial English.
So one thing I discovered is that Word smart quotes are one of these characters.
The other thing is that I tried to highlight some kind of invisible character that seemed like an entire line and paste it in, and then it set the beach ball spinning. It’s been going for many minutes now. I think it’s just taking a really long time to execute the search…? I think I’m going to force quit and reopen though…
Yeah, now that I look at my example in the second to last paragraph I see that the example apostrophe (typographic) is barely noticeable, but that is what I was noting as well. There will be some punctuation you need to add to the bracket for stuff that falls inside of a word.
As for the pesky invisible character, if you don’t know what something is you can paste it into the Mac’s special character browser’s search field and that should give you the code. It might not be something you want in the document to begin with. Whatever it is, if it’s causing search to hang (and no, nothing legitimate should take a really long while to search) it might be problematic further down the line as well in the compiler.