Chinese dictionary / spell check

MilkTeaCat · March 23, 2023, 8:30am

I tried using Scrivener for Chinese writing a few days ago, but found that everything I typed was squiggled as spelt incorrect. I realized that it was because I didn’t have the Chinese dictionary installed, so I went to download it, only to find that L&L doesn’t offer one to download. After doing some research, I found Scrivener uses Hunspell for its dictionaries. I went to download a Chinese one, as the Github page says that it fully supports simplified Chinese, However I couldn’t find a dictionary for download anywhere. I checked sites like Libreoffice, Openoffice, Firefox, etc., but none of the sites have a Chinese dictionary. I’m starting to suspect that Hunspell doesn’t actually have a Chinese dictionary.
Is there a way/workaround I can use write with Chinese in Scrivener?

narrsd · March 24, 2023, 5:24pm

I had a thought you might be out of luck on this, and a little looking suggests the difficulty suspected is the difficulty there is. Chinese isn’t susceptible to the kind of approach that spells many other languages. You can read about it in comments here.

I would just turn spell checking for writing in Chinese, so you don’t get the distracting red underlines. I’m not on a Windows machine at the nomment, but you might be able to just turn off spellcheck-as-you-type.

MilkTeaCat · March 25, 2023, 4:21am

Thanks for the response!
I’ll turn spell checking off when I chose to use Scriverner for Chinese in the future.
I noticed that Japanese and Korean are available, even though they work differently from western languages and are similar to Chinese, and it feels unfair.I understand that L&L is not responsible for the creation of these dictionaries, and that they would have to change their spell check system just to accommodate this one language, however it seems to be a big oversight to exclude such a widely used language and such a huge portion of potential users, especially when Scrivener supports setting its interface to Chinese.
It would be nice to have some insight from L&L or people involved with Hunspell, and know whether users hoping for this functionality should hold out hope.
To throw some ideas out there, would it be possible to “borrow” spell check functionality from Windows using the Spelling Check API or via the .acl files?
If all of the above really is impossible, it would be nice to at least have a correct character counter for Chinese. Since there are no spaces between characters, Scrivener counts big blocks of text as a single word and not as multiple characters. Having this at the very least would make writing Chinese in Scrivener possible.

narrsd · March 25, 2023, 5:15am

Well, you’re a little mistaken to thinl that Japanese and Korean work in the same way as Chinese. I wasn’t going to mention this, to avoid complication, but I see you are as perseverant as many who come here somehow thinking that their will for something should cause it to appear, or the stars to align diferently than they naturally may.

Korean particularly, and Japanese which is quite related, are syllablic languages that compose definite phonemes, fundamentally. Korean in fact has a perfect phonemic alphabet; and what may seem characters are mostly two to four letters combined in a box format. Japan has three extra alphabets, which are written linearly.

Both of these, ,it’s true, can use Chinese characters, and this was a mark of degree of education, drawing off the thousand years or so of Confucian influence. But in Korean or Japanese, there would be no way to encorporate their grammar without their own symbols dominating, so that the characters are typically specific nouns or parts of adjectives, on their own mostly one or two a time, and that cluster having a specific meaning.

Chinese by comparison, as you’ll know from your studies, jams all the characters together, a reader recognizing what groupings are appropriate by context and the usual memorization of patterns it relies so heavily on in all ways…

Because there are these overt grammars in Korean and Japanese, what are ‘words’ are mostly very apparent – even if some of them are characters – and there are spacees clearly delineating, so a spell checker in the sense we are used to them is much more possible. A Chinese spell checker would need actually to be able to understand the language at aa level we call these days artificial intelligence. While thi s is ceertainl possible by now, appearing on your telephone or lapttop, it isn’t actually done there, but rather on immense machines which we now are recognizing as a growing source o global warming, especially when their linguistics get as good at the GPT-3.4 interlocutors appearing now.

If you see where this is going, it isn’t going to fit on your personal machine to have Scrivener or its Hunspell decode your Chinese lesson homework…

The suggestion I did think of to pass on is pretty simple, though – feed your Chinese needing editing to translate.google.com, or another favorite – possibly as your ability improves you could use a China-constructed machinery which might do even better. Probably in time, China develops some level of actual spell checker if you like, by combining programming/nerual approachesin such a way, maybe ready when you could use its Chinese output adeptly enough. It still will be a service, not on a portable item.

What started my own further reflecting on your probllem was remembering my Korean compatriots spelling out the Chinese characters wiith a finger on the other hand’s palm, to let motor memory act – it’s a warm, very human memory, and insight into how htey did it, in their own education. Combined approaches, again – rather than the western style of perfection I was there to assist in getting us walking across certain bridges. This is what we do in complex designing as well, a topic involving those bridges – and I recommend the flexibilty to you.

MilkTeaCat · March 25, 2023, 5:41am

I’m sorry for my ignorance. I don’t actually know Korean and was wrong to assume it was similar to Chinese. However, I do know a bit of Japanese and shouldn’t be wrong in saying that there aren’t spaces between characters, which is what I meant in saying that these languages are similar to Chinese. However, as you pointed out, most of Japanese is phonetical, and as Kanji are distinct from the phonetic Characters, so I realise Hunspell probably “adds” spaces between phrases in order to spell check these languages.
But I wanted to point out that Word does support spell checking for Chinese, however limited it may be, and that implementing a system for spell checking Chinese would benefit a large number of Scrivener users. In the end, it is L&L’s choice whether or not to implement such a system. I acknowledge that it would be a lot of work and would respect L&L’s decision if they refused, but It would never hurt to ask.
In addition, Hunspell’s Github says that they do support Simplified Chinese fully, so it could be that there just aren’t any public dictionaries available just yet. In that case it wouldn’t be unreasonable to hold out some hope, right?
Also, even if Chinese spell checking was not destined to be in Scrivener, a correct word count would still be better than not having one.

narrsd · March 25, 2023, 6:48pm

a morning’s energies to go further from last night’s:

Well, fair enough. And to be more fair on my side, part of the tone in last night’s note was due to another person leaning on the small yet quite capable team at Scrivener, for something much less sensible.

I had a look, and yes, MS does have a proofing tool (not exactly a spell checker) for Simplified Chinese, which operates offline. I dug into background, and there wasn’t much, except to say it’s been improving over the years, and some papers from ten or twelve years ago from which you could see they’d moved on from algorithmic means to heavy use of machine learning–and then the kind of multiple collaborative engines I mentioned.

First, let’s look at how well it works. I installed it on my machine, and put in some sample text. I got it to find one gross error, made by repeating a character several times in a row. But it failed entirely on two short test sentences from one of those computational science papers, where you could see Microsoft researchers and top (Tsing Hua) Chinese university scholars working together. Here are the test texts, with context:

Consider the sentence “他在文學方面有很高的造旨。”, in which the character “旨” is a typo. For another sentence “他在文學方面有很高的造藝。”, the character “藝” is also a typo. The correct character of the two typos should be “詣”. Chinese spelling errors often stem from two main reasons: one is similar sound (e.g., *藝 and 詣) and the other is similar shape (e.g., *旨 and 詣), as pointed by Liu et al. (2011).

Now, given how my Pixel phone could visually and instantly translate the Chinese screens on the MS Chinese package installation for me, it’s clear how well Chinese can now be understood–and though it is a different and more difficult problem, I’d venture layering on contextual phrase nonsense recognition might probably be doable today at a pretty useful level – creating a proofing solution, which is necessarily much more involved than the idea of spell checking, we should be clear, and on its necessity in working with Chinese.

To make an aside, 'spell checking would be like what my Korean grad students and friends, as we were much of an age, were doing in their finger-writing of characters–they were remembering all the necessary orders of strokes, ticks, etc. that make a written character. But in typed text, those problems won’t anywhere as likely occur, unless there’s a phonetic similarity, etc…

I need to say also that this MS Chinese add-on is possible because the end result of machine learning is a coding that is immensely less stressful on computer workings than the training itself, so it can be fit in, as certain things also are on phones, abilities where they don’t need to phone home over the internet for powerful abilities like the visual translation I just mentioned.

Whether this reductive ability extends to proofing analysis for Chinese, I still remain doubtful. Because of the context analysis, the problem feels much more like what the GPT-3/4 machines are doing, and that gets into the power levels of even their electricity draw which was mentioned in scale of concern. One could hope, though, that some way of learning and compressing such larger structures might emerge.

Maybe you can see some hint in what I’ve related here, why I suspect doing your own ‘collaborative’ construction, using language interpreting resources that do exist, seeing what happens when Google Translate attempts your phrases and sentences, might make an effective approach. When the result made nonsense, you’d get two clues: the nonsense itself, and what it interpreted out of your mis-constructed character order. I guess I often use such tactics myself, not least in getting the emerging GPT tools to uncover aspects to a problem I can elaborate on or find inspiration in to solve it, where they remain unable to home in this way or be accurate, in much of what they say.

And also as suggested, since you are intent on China, why not look into some tools they are developing there? I found a handful of likely looking online proofing tools –searching for spell checkers isn’t the worst query to use. Depending on your level of progress, you could use a phone as I just did to understand what they are telling you. Collaboration again…it can be a very effective tactic for both we humans and the ‘hand tools’ and larger we now design…

It’s possible that another person here on the forum, who has quite an involvement with China aspects, @xiamenese , might step in here with some hints, having mentioned him.

To close, none of us knows all things, nor can, it’s very clear. It is both polite and useful to consider the demands we make for others–in this case the slow progress of many, many Microsoft persons and their mammoth global computing resources vs. what you could think in respect for a small, dedicated team building a compact resource. Computing is not magic, and it is not ‘held back’ very often by stubbornness; rather by the shocking scale of effort for what those naïve of it often think are just the easy things that could be done. If they were easy, they would be done. Or if sensible, which is another aspect of the question that often comes up here, where Scrivener has done some very intelligent things about matters such as cloud data choices…but i digress, truly

For reference, my own background includes having once created a form of(natural language interpretive system that worked highly accurately, about 20 years before the way the PhD crew at Bell Labs where I did this thought was required could arrive with advances in technology. It was used very practically around the world for that time in their system. It was in fact built out of a philosophically collaborative arrangement, at that time purely out of instinct, as we discuss here; part of it, almost like a poker game.

But I like my Korean story, and many other experiences that become stories, much better…become is a wonderful and common Korean word, and here again I find a smile…!

Good fortune on your language, and experiences to come, @MilkTeaCat

Clive

MilkTeaCat · March 26, 2023, 7:27am

Thanks for your reply!

I have come to realize that what I wanted was not really a “spell check” but more of an integration of Chinese to be used in Scrivener. As of now, Scrivener is really unfriendly for those who wish to use Chinese in their works.

I agree that it would be hard for L&L to implement Chinese checking. As Chinese is not spelt, any combination of characters could be valid. The red squiggle underline for other languages would not apply. With “造诣”, if it is mistyped as “造艺” or even “造旨”, the “wrong spelling” could be valid in another context, although not in the one you provided. As you described, a grammar checker for Chinese that would have to recognize the context in which a phrase is in and check it accordingly would probably have to incorporate some level of AI. The checker would have to know most common combinations of characters and their contexts as well, which I assume wouldn’t work with Hunspell.

(I looked further and found that “languages supported” referred to localization of Hunspell, and not the dictionaries, which was disappointing.)

I see that if L&L were to incorporate such a feature, they would have to either build it from the ground up or use another implementation, both of which would require a lot of work, especially for a western company. I’m probably not the first person to bring this up, and without more insight from L&L, I think it’s safe to say that this feature isn’t coming.

However, I still suggest that L&L turn off red squiggles for Chinese by default. I also suggest refactoring the word counter, or at least providing an option for the word counter, to count each Chinese character as a word. These changes would make the program a lot more friendly for Chinese users, and I would be really grateful for them!

Thanks for your patience,
MTC

narrsd · March 26, 2023, 4:24pm

Glad you got to some comfort on this, MTC, and some useful pictures forwards

I think that’s a fine idea, that Scrivener alter to not try to spell check languages its spiller doesn’t deal with.

And it might be technically possible on the primary surface, because languages with different orthography have their own code spaces in the Unicode standard everyone uses now for text.

As always, with some challenging aspects, though, for example that it would need to break up runs of text that are checkable, around insertions from another language that is not.

Really, Hunspell should be taking care of this; just possibly, there’s a way to get it to do so, as this must have come up before.

Good fortune with the Chinese – it’s a long path surely, but should have many rewards, each to be found in its place along the way…

Best,
Clive

xiamenese · March 26, 2023, 6:03pm

As a Mac user, I can’t really comment on what is possible with the Windows version. However:

If you’re just typing Chinese, simply turn spelling checker off in Options.
If you’re mixing Chinese and English and spell checking the English as you type is important for you…
- If you have paragraphs in English and paragraphs in Chinese, split them off into separate Binder documents, and turn spell checking on when you’re working in English and off when you’re working in Chinese; it’s at Edit → Spell Checking and Grammar and there is a keyboard shortcut (on the Mac, don’t know about Windows).
- If you are mixing Chinese and English within paragraphs, then you’ll have to decide whether to keep spell checking on or off. I personally have it off, and do spell checking in my word processor after compiling.
On character count:
- In File → Options → Editing → Options, there is an entry “Live counts show:” with “Characters” being one of the options, though you can have both Words and Characters ticked.
- Obviously this works best when Chinese and any other language(s) are in separate Binder documents.

Remember, Scrivener works best when the text is split up into small(-ish) Binder documents, though it must be admitted that it’s smoother on the Mac where you can select and edit across Binder documents in a Scrivenings session.

Hope that helps.

Mark