Html importer is too clever for its own good?

paulbeard · August 30, 2021, 1:35am

I just imported a lot of HTML files and found that the importer decided to turn all my curly quote and em-dashes into gibberish. They are fine in the source documents, the munging and muddling happens at import. Can we turn that off? Or, failing that, is there a more global search and replace to make this easier to fix? 84 documents with the same repetitive steps is so tedious. Importing as txt means I lose all the html links and other useful stuff. Not all of it comes across but enough that I can cope. This other business is too much.

paulbeard · August 30, 2021, 2:46am

Or, since I can’t edit the post, what formats does the parser/importer expect? I can massage the files before importing using tidy or perhaps other tools. But I am shooting in the dark here.

kewms · August 30, 2021, 4:23am

Are you importing as a WebArchive or as text? Relevant options are in the Scrivener → Preferences → Sharing → Import tab. (Note that the “text” import is rich text, not plain text. That is, it will preserve links.)

Can you post a sample link?

paulbeard · August 30, 2021, 4:30am

Importing text files as html I saved from writing I did elsewhere. Links are preserved but I still get this messiness:

who live and work there need â€” for simple economic reasons â€” to

That gibberish should be a neat pair of em dashes. This is from the text file before importing:

who live and work there need — for simple economic reasons — to

So how do I import that without issue?

rms · August 30, 2021, 7:18am

Would help debugging this if you post the source/raw HTML here, including the Header and any CSS that might be involved.

AmberV · August 30, 2021, 1:13pm

That looks like a file encoding issue to me. Scrivener, like all native macOS software, will be expecting files to be UTF-8. I’d open these HTML files in a coding editor capable of displaying and modifying the file encoding, like BBEdit. You may even be able to use TextEdit to change the encoding upon using “Save As”.

paulbeard · August 30, 2021, 1:58pm

No headers or CSS…they are just txt files renamed as html with the links and other styling. No HEAD or other tags, no CSS. I could (and have) run them through htmltidy to add the other elements that the parser might be expecting but it has made no difference. The files are all UTF-8.

AmberV · August 30, 2021, 2:18pm

Oh, that might be the issue actually. If the documents aren’t valid HTML files, then the internal conversion of HTML to text could potentially be falling back to the assumption that the document is ASCII, even though it isn’t, since it lacks a doctype and encoding indicator at the top of the file.

paulbeard · August 30, 2021, 2:23pm

I have passed them thru htmltidy to add all that chrome/apparatus and it still doesn’t work. If anyone knows what types of HTML it expects so it doesn’t rewrite the file, that would help.

I can’t upload an attachment. And I can’t include links in posts.

AmberV · August 30, 2021, 2:45pm

You have to establish a bit of a reputation with the forum software to access those functions. It doesn’t take much, just a little searching, reading and a post or two. Enough to be difficult for a spam bot to replicate.

You shouldn’t need an attachment to share the salient portion of an HTML sample though, just put three backticks (`) on a line, paste the HTML on the next line, and then another three backticks on the line following (that indicates a code block). We’d really only need to see the part up to the end of the HEAD.

For example, here is what we’d like to see in an HTML5 file:

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en">
<head>
	<meta charset="utf-8"/>
	<title>Test</title>
</head>

If htmltidy assigned a different charset, you may still be in the same predicament as not having a charset line at all.

paulbeard · August 30, 2021, 3:21pm

<html>
<head>
<meta name="generator" content=
"HTML Tidy for Mac OS X (vers 31 October 2006 - Apple Inc. build 1045), see www.w3.org">
<title></title>
</head>

Will that work?

rms · August 30, 2021, 3:31pm

inside the header, try adding the meta field as the example given by @AmberV above.

paulbeard · August 30, 2021, 4:09pm

<html>
<head>
<meta name="generator" content=
"HTML Tidy for Mac OS X (vers 31 October 2006 - Apple Inc. build 1045), see www.w3.org; charset=utf8">
<title></title>
</head>

Still no joy.

rms · August 30, 2021, 4:15pm

the charset line is what i had in mind.

also a copy paste of the offending character.

paulbeard · August 30, 2021, 4:41pm

see link number 4 in the chain above…I included the characters there.

rms · August 30, 2021, 5:05pm

i am trying to help you by pointing at what you should look at. above is how you report Scrivener displays the html. that may not be what is in the html source. how things are interpreted for display depends on the ascii code of the character and the character set. without being told a character set, anything that displays html (Scrivener, Safari, anything) will guess or assume.

paulbeard · August 30, 2021, 6:01pm

I have provided the source and the output, the meta data I was asked to add. If you don’t see the charset in the most recent example, it’s there. The output was exactly the same. â€” was output instead of — just as before.

What would help is know what Scrivener’s importer expects. I’m not sure I can provide that but it would be a good place to start. I don’t know if it prefers numeric html entity codes.

I’m not sure what else I can do. This feature doesn’t do what I want or expect, so it’s time to figure out another way.

xiamenese · August 30, 2021, 6:43pm

In your “still no joy” post you included the UTF8 in the content of the ‘name=“generator”…’ field. Have you tried putting it in as a separate line as in AmberV’s post? Does that make any difference.

Mark

paulbeard · August 30, 2021, 7:30pm

Well, that works. Inexplicably but what can anyone do about that? I wonder why the parser doesn’t evaluate multiple values in the meta section but that’s not for me to solve. Now to work out some way to add that programmatically to 84 files. tidy isn’t up to it, I see.

AmberV · August 30, 2021, 7:50pm

Depending on what version of tidy you are using, there should be an option for that (tidy has lots of options). Here is how it would work on my system:

tidy --add-meta-charset yes <filename.html>