Html importer is too clever for its own good?

rms · August 30, 2021, 8:15pm

The reason is explainable. It works as I told you above. Unless you tell the HTML “viewer” programme what the character set is, it is going to assume. And for you, that assumption was not what you expected. Once you told it, it works.

paulbeard · August 31, 2021, 3:25am

It would be useful if Scrivener’s importer could accept that meta tags can contain multiple values. Perhaps it’s documented that each meta value has to appear on a line by itself. But this has been a common usage for 25 years or more, predating Scrivener.

Why it parses plain text files differently that html files, based on nothing but the file name/extension is unhelpful but that’s computers in a nutshell. It could be configurable or the assumptions could be documented. Either would suffice. There are some config options for html import but nothing that applies to this use case. I wonder if anyone at Scrivener HQ has considered that someone might want to convert a corpus of HMTL files to a Scrivener project and how that might be accommodated.

paulbeard · August 31, 2021, 3:27am

and of course the meta tag example doesn’t show. triple ` doesn’t escape that.

<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">

devinganger · August 31, 2021, 8:33am

It does, if the values are properly constructed.

Taking a closer look at the meta line you posted, it seems like the tool you were using malformed it:

<meta
name="generator"
content="HTML Tidy for Mac OS X (vers 31 October 2006 - Apple Inc. build 1045), see www.w3.org; charset=utf8"
>

Looking at the quotes, there should a couple of extra ones like this:

<meta
name="generator"
content="HTML Tidy for Mac OS X (vers 31 October 2006 - Apple Inc. build 1045), see www.w3.org;"
charset="utf8"
>

AmberV · August 31, 2021, 2:28pm

Just to make sure we’re on the same page, you seem to be conflating Scrivener with the Mac text engine converter and/or the Web Kit page viewer—if you’re importing as .webarchive and displaying the rendered read-only copy. If you have a beef with how either of those work then file a report with Apple. We have nothing to do with how that works, and Scrivener itself is all but ignorant of HTML.

That said, as devinganger notes, the issue you’re seeing has to do with a syntax error, not a lack of support for bog standard HTML header usage. The second example you posted works fine for me, provided I match the charset with the actual document encoding, naturally.

Why it parses plain text files differently that html files, based on nothing but the file name/extension is unhelpful but that’s computers in a nutshell. It could be configurable or the assumptions could be documented.

Would you like for them to be imported as plain-text? There is a setting for that if so, in the Sharing: Import preference pane. As I’m sure you can appreciate, most people are going to want HTML files to convert to either rich text or archived pages, so .html isn’t in the Plain text import formats file extension list.

There are some config options for html import but nothing that applies to this use case. I wonder if anyone at Scrivener HQ has considered that someone might want to convert a corpus of HMTL files to a Scrivener project and how that might be accommodated.

Indeed we have, under the proviso that one is importing a corpus of valid and well-formed HTML files. As you can see we provide three fundamentally different options that will be of use to almost every use-case:

Import as read-only web page for reference.
Import and convert to RTF as an editable text source.
Import as raw HTML syntax, for those that work in it directly.

However in all but the third case, it is not expected to see a good result from a malformed HTML file, or from TXT files that have ‘.html’ extensions but don’t contain a valid HTML structure within them. Get your files in good form, and you’ll find you can import all ~80 of them to whatever result you prefer, depending upon your settings.