I just imported a lot of HTML files and found that the importer decided to turn all my curly quote and em-dashes into gibberish. They are fine in the source documents, the munging and muddling happens at import. Can we turn that off? Or, failing that, is there a more global search and replace to make this easier to fix? 84 documents with the same repetitive steps is so tedious. Importing as txt means I lose all the html links and other useful stuff. Not all of it comes across but enough that I can cope. This other business is too much.
Or, since I canât edit the post, what formats does the parser/importer expect? I can massage the files before importing using tidy or perhaps other tools. But I am shooting in the dark here.
Are you importing as a WebArchive or as text? Relevant options are in the Scrivener â Preferences â Sharing â Import tab. (Note that the âtextâ import is rich text, not plain text. That is, it will preserve links.)
Can you post a sample link?
Importing text files as html I saved from writing I did elsewhere. Links are preserved but I still get this messiness:
who live and work there need ââŹâ for simple economic reasons ââŹâ to
That gibberish should be a neat pair of em dashes. This is from the text file before importing:
who live and work there need â for simple economic reasons â to
So how do I import that without issue?
Would help debugging this if you post the source/raw HTML here, including the Header and any CSS that might be involved.
That looks like a file encoding issue to me. Scrivener, like all native macOS software, will be expecting files to be UTF-8. Iâd open these HTML files in a coding editor capable of displaying and modifying the file encoding, like BBEdit. You may even be able to use TextEdit to change the encoding upon using âSave Asâ.
No headers or CSSâŚthey are just txt files renamed as html with the links and other styling. No HEAD or other tags, no CSS. I could (and have) run them through htmltidy to add the other elements that the parser might be expecting but it has made no difference. The files are all UTF-8.
Oh, that might be the issue actually. If the documents arenât valid HTML files, then the internal conversion of HTML to text could potentially be falling back to the assumption that the document is ASCII, even though it isnât, since it lacks a doctype and encoding indicator at the top of the file.
I have passed them thru htmltidy to add all that chrome/apparatus and it still doesnât work. If anyone knows what types of HTML it expects so it doesnât rewrite the file, that would help.
I canât upload an attachment. And I canât include links in posts.
You have to establish a bit of a reputation with the forum software to access those functions. It doesnât take much, just a little searching, reading and a post or two. Enough to be difficult for a spam bot to replicate.
You shouldnât need an attachment to share the salient portion of an HTML sample though, just put three backticks (`) on a line, paste the HTML on the next line, and then another three backticks on the line following (that indicates a code block). Weâd really only need to see the part up to the end of the HEAD.
For example, here is what weâd like to see in an HTML5 file:
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en">
<head>
<meta charset="utf-8"/>
<title>Test</title>
</head>
If htmltidy assigned a different charset, you may still be in the same predicament as not having a charset line at all.
<html>
<head>
<meta name="generator" content=
"HTML Tidy for Mac OS X (vers 31 October 2006 - Apple Inc. build 1045), see www.w3.org">
<title></title>
</head>
Will that work?
inside the header, try adding the meta field as the example given by @AmberV above.
<html>
<head>
<meta name="generator" content=
"HTML Tidy for Mac OS X (vers 31 October 2006 - Apple Inc. build 1045), see www.w3.org; charset=utf8">
<title></title>
</head>
Still no joy.
the charset line is what i had in mind.
also a copy paste of the offending character.
see link number 4 in the chain aboveâŚI included the characters there.
i am trying to help you by pointing at what you should look at. above is how you report Scrivener displays the html. that may not be what is in the html source. how things are interpreted for display depends on the ascii code of the character and the character set. without being told a character set, anything that displays html (Scrivener, Safari, anything) will guess or assume.
I have provided the source and the output, the meta data I was asked to add. If you donât see the charset in the most recent example, itâs there. The output was exactly the same. ââŹâ was output instead of â just as before.
What would help is know what Scrivenerâs importer expects. Iâm not sure I can provide that but it would be a good place to start. I donât know if it prefers numeric html entity codes.
Iâm not sure what else I can do. This feature doesnât do what I want or expect, so itâs time to figure out another way.
In your âstill no joyâ post you included the UTF8 in the content of the âname=âgeneratorââŚâ field. Have you tried putting it in as a separate line as in AmberVâs post? Does that make any difference.
Mark
Well, that works. Inexplicably but what can anyone do about that? I wonder why the parser doesnât evaluate multiple values in the meta section but thatâs not for me to solve. Now to work out some way to add that programmatically to 84 files. tidy isnât up to it, I see.
Depending on what version of tidy you are using, there should be an option for that (tidy has lots of options). Here is how it would work on my system:
tidy --add-meta-charset yes <filename.html>