What is the encoding of content.rtf/why utf-8 decoding error on 0xa0?

I am reading the content.rt files mostly successfully, but a couple throw an error like this

‘utf-8’ codec can’t decode byte 0xa0 in position 1: invalid start byte

RTF Conversion failed on line/para:

\par\pard\plain \fi360\sa200\qc\ltrch\loch {\f0\fs20\b0\i0\ul\ulc0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~}

UPDATE Re quote above: there are slashes between ~, and this is the line separator in Scrivener. This particular error I can make go away by… removing the line separator. But there are others, such as

\pard\plain \fi360\sa200\ltrch\loch {\f0\fs20\b0\i0 Nathan

but inspecting the hex shows no 0xa0 at the claimed position. It looks like it could be a strip rtf issue, but I’m eager for any insight from anyone with any ideas

Old Text continues…

I have opened the file in Notepad and used SaveAs to detect the encoding and it indicates that it is indeed utf-8.

Any ideas from anyone about what is going on?

FYI I am reading the file with python and explicitly decoding as utf-8; I am performing para by para rtf to text stripping with rtf_to_text from the striprtf package (imported via from striprtf.striprtf import rtf_to_text)

If I attempt to strip the rtf of the whole file at once I get even more utf-8 conversion issues, but now it has been reduced to <10 out of 200 files and the issue is with this one “invalid start byte” of 0xa0 maybe someone knows something that could help me resolve this.

PS This is in aid of proper multiple whole word searching, which now works fine (even for hyphenated words!) except for these odd paras that get skipped because of the utf-8 error.

Not really a geek, but 00A0 is no-break-space; you don’t, by any chance, have any paragraphs in the offending documents that have a spurious no-break space as their first character, do you?

Just a wild guess.

:smile:

Mark

1 Like

That’s stretching the “not really a geek” to spot that 0xa0 is &nbsp/ char 160 a bit, isn’t it? :slight_smile:

No, in fact I can’t find the offending characters at all really. I just updated the question to that effect…

Still perplexed.

Do not wish anyone insufficiently geeky to waste time on this :wink:

Resolved-ish

Changed the unicode decoding from “strict” to “backslashreplace” per the documentation

The errors argument specifies the response when the input string can’t be converted according to the encoding’s rules. Legal values for this argument are 'strict' (raise a UnicodeDecodeError exception), 'replace' (use U+FFFD , REPLACEMENT CHARACTER ), 'ignore' (just leave the character out of the Unicode result), or 'backslashreplace' (inserts a \xNN escape sequence). The following examples show the differences:

With luck this has made the text searchable albeit with a tiny piece of junk in it. Acceptable.

I might go back to ‘ignore’ now I know it just leaves out the offending character. RTFM, OP!

Glad you’re finding a solution.

I’m familiar with 00A0/&nbsp; because so many DOCX texts sent to me by collaborators in China are full of unnecessary &nbsp;s, as are texts downloaded from websites that have been edited in situ.

Someone here was having trouble with them causing poor line-wrapping only a week or so ago.

:smile:

Mark

1 Like

well played :slight_smile: – and now we have a little clue about your handle!

That’s been hiding in plain sight for well over a decade! Yes, I lived in Xiamen (pronounced ‘hsiamen’; or roughly ‘ya-mng’ in Hokkienese, the local language, and known in the past as “Amoy”) for 14 years. So I chose “xiamenese” as my handle on most forums like this one, which I joined in January 2007!

You’ll find posts about my collaborating with Chinese colleagues going back much of that way.
:smile:

Mark

That’s why I didn’t get it, as Amoy is the name I knew. A handsome city with apparently much to recommend it, especially for being built on lowlands. Somehow it makes me think of Liverpool, and maybe for good reasons of history.

My father grew up in Nanking, as it was known then, and I had a life in Korea a time of my 20s, the last thing teaching a bridge from science to technology for some very adept grad students indeed. This well filled a number of hours, but there were somehow also many more. An intensity of life, in those times…