Feature : medium : Highlights on HTML act counter-intuitive

Let me preface, I am pretty sure I know why the tool is behaving the way it is, and given what I think I know, I am pretty sure there is no easy fix – but figured I would bring it up anyway just in case:

In the situations where you choose HTML -> Text import, and the original HTML has text elements set into a coloured background, the coloured background comes along for the ride, along with the text colour in the original HTML. One thing that I would like to do a lot, and I imagine a lot of people would, is highlighting and annotating various parts of some web research document. If the text is as described above, the highlight tool eradicates the existing background colour underneath the selection. In the case of dark background/light foreground sites, this means the selected text now appears to have been deleted. Actually it is there, in white (or whatever light colour the web designer chose). If I press highlight again it places the selected highlighter colour behind the text. If it is a sane site design, it is probably okay that point; if it is an inverse site with light text, then the result can be fairly unreadable.

So, solutions? On one hand, an obvious answer would be a preference to strip colour from the HTML document. But, I understand that might be well nigh impossible without writing your own HTML parser (ha).

What if it was done post-import: Import the HTML using WebKit, then remove all colour information from the document?

I imagine there is no way for the highlight tool itself to detect the situation, based on its behaviour. It thinks the text is already highlighted to begin with?

Well, just so you know, anyway.

Upon import:

  1. Select All (Cmd-A)
  2. Show Fonts (Cmd-T)
  3. Set foreground colour to black
  4. Remove Highlight (Cmd-Shift-H)

Given that there is no way to know whether the user wants to keep the formatting or not, I think this one has to stay off the list. I use Apple’s methods for converting HTML to text, and I don’t want to play with them too much.

Incidentally, I am in the process of moving the options to convert HTML to text to the Preferences so that you don’t have to be bothered by a panel every time you import… Pipe up now if you think that this should remain per-import. If so, it means some rewiring so that the import detects whether any HTML or webarchives are being imported. Groan.

Yeah, I’ve been keeping tabs on that thread. I think removing the per-import aspect will not be too much of a hassle for 95% of us, and the constant warning box was a hassle for 100%. But yes, that is why I mentioned a possible “strip format” preference, since I noticed you were going to migrating the import features into preferences.

Yeah, I kind of figure that if you want to convert HTML or webarchives to text, you are going to want to do it for most files. And of course, you can always import everything as webarchives and then just convert some of them to text after import using File > Convert Web Archive to Text…