Test Scrivener's New .docx Converters

KB · May 1, 2019, 2:37pm

Those of you who like to read these things may have noticed this hidden away in the release notes for Scrivener 3.1.2:

Scrivener now contains a brand new, native, in-house converter for Word .docx files (affecting import, export and Compile). Because this converter has not yet been tested widely enough, it is turned off by default - by default Scrivener still uses the mature Java-based converters from third-party company Aspose, as it has been doing for several years now. If you would like to test the new converters, open Preferences and then, under Sharing > Conversion, turn off “Use enhanced converters for Microsoft Word and OpenOffice documents”. This will turn off the Aspose Java converters, causing Scrivener to use the new in-house .docx converter instead. (Note that this will also result in poor OpenOffice conversions, however.)

When the Java-based Aspose converters for creating Word and Office files fail, Scrivener now shows a message to warn the user that Apple’s more basic converters will be used instead.

When the Aspose converters fail, Scrivener now falls back on the in-house converter rather than the low-fidelity Apple converter.

Converting to and from Word format is an important feature of Scrivener, but it’s not actually that easy to do. Apple provides some standard converters, but they are terrible - they don’t support images, comments or footnotes, they lose line spacing, and they obliterate other formatting (needless to say, Apple does not use them in its own apps). For the past few years, therefore, Scrivener has used a third-party, Java-based converter (from Aspose) for importing and exporting DOCX files. Apple’s own converter is only ever used if the Java converter fails for some reason; you could also force Scrivener to use Apple’s converter instead of the Java-based one by turning off “Use enhanced converters for Microsoft Word and Office documents” in Scrivener’s Sharing > Conversion preferences.

The Aspose Java-based converters work very well, but as Apple tightens app security, my fear is that it is eventually going to be very difficult to call a Java app from a Cocoa one.

Meanwhile, on iOS, things are even more difficult: Apple’s own semi-functional converter isn’t even available there, and it’s not possible to build in and invoke a separate Java process as you can on the Mac. For our iOS version, then, I built my own .docx converter that did just enough - everything that the iOS version required. This involved getting fairly familiar with the OfficeOpenXML docs and writing a lot of XML code. Fun!

Over the past few months, I have begun to expand on that work by making it cross-platform (Mac and iOS rather than just iOS) and adding to it support for everything that the Mac version of Scrivener needs for its many Compile features. As a result of this work, as of Scrivener 3.1.2, Scrivener now uses my own custom .docx converters when the Java converters either fail or are turned off via the Preferences.

Building these converters to do everything required for the Mac version has been quite complex, and they require a lot more testing. At some point, they may well replace the Java converters for .docx conversion, but they need more real-world use first. I’d therefore be grateful to anyone who is feeling intrepid or experimental enough to use them instead of the Java converters and report any issues with them.

To test out my custom converter, all you need to do is this:

Open Scrivener’s Preferences (Scrivener > Preferences…).
Click on “Sharing” in the Preferences panel toolbar.
Select “Conversion”.
Deselect “Use enhanced converters for Microsoft Word and OpenOffice documents.” (Note that the information under this checkbox is currently misleading when it comes to .docx - it says that if you un-tick it, the macOS standard converters will be used and you may lose formatting. This is no longer true, but I’ll be leaving the text like this to discourage users turning off the setting until I know my own converters are rock solid.)

Important Note: If you turn off this setting. the standard macOS converters will be used for OpenOffice and legacy .doc files, so only do this if you don’t use those formats much.

We have tested lots of documents using our own converter and it should be working well, creating and importing documents just as well as the Java converter. However, it is brand new and there are bound to be issues we have missed, so we’re thankful for any issues you find and report - please post anything you find in a reply to this thread, or email us at mac.support AT literatureandlatte.com. And of course, if you do find any show-stopping bugs, you can easily switch back to using the Java-based converters.

Thanks and all the best,
Keith

xiamenese · May 1, 2019, 5:44pm

Just tried compiling Shirley’s English–Chinese WIP—with which I’m helping—using your converter to .docx on my 13" MBP. 187 pages of text in Chinese, compile time 5 secs, opened automatically in Word (v. 15.13.3 so quite old) in 10 secs. Text is perfect, though it is straightforward text with no images, footnotes, comments, etc.

Great job!

Mark

KB · May 1, 2019, 7:01pm

Thanks Mark! I’m glad it’s working so far.
All the best,
Keith

rbenit68 · May 1, 2019, 7:13pm

Hi,
I got crashes importing the third, fourth and fifth files in this series:

https://calibre-ebook.com/downloads/demos/demo.docx
http://2017.wceam.com/files/2016/07/Springer-eBook-WCEAM2017_template_.docx
http://www.ieee-iemdc-conf.org/Portals/0/IEMDC%202019_Digest%20Template.docx
https://ipmawc.com/wp-content/uploads/2019/04/Paper-Format.docx
http://iglc2019.com/wp-content/uploads/2019/01/IGLC-27-paper-template-2019-new-10-Jan-2019.docx

HTH.

KB · May 2, 2019, 8:06pm

Many thanks for those files, they are really useful. I’m not seeing any crashes with those documents in my current development version, which is good news - I fixed a crash in the importer recently so it looks like it has fixed things for these documents too.

I have spotted a few inconsistencies when importing those documents, though, with fonts and issues in tables. I’m working on that now. My main focus is always on ensuring that export/Compile is 100% solid and reliable, but obviously I want import to work as well as possible too.

All the best,
Keith

nontroppo · May 5, 2019, 4:06am

Hi Keith, gosh writing your own docx converter is a walk on hot coals I’m sure!!!

One point I see is compiler image size conversion. If I have an image at the default sizing:

Screenshot 2019-05-05 at 11.55.47_SMALL.png

In Word the size becomes this:

The “final” size itself seems fine ( (1410px/72dpi) * 2.54 = 49.74cm width ), but the original size isn’t and the scale is set to 200% to compensate?

xiamenese · May 5, 2019, 9:17am

Are you working on a retina screen? When you open an image in Graphic Converter, you are presented with an option of “Full Size” or “Smaller”. “Smaller” is retina sizing, full size is non-Retina and reports at 200%.

Could that be involved?

Mark

nontroppo · May 5, 2019, 3:13pm

Yes, I have a retina screen, though I am confused why you would scale an image to 200% on a retina screen (it is already downsampling automagically). This default result is a massive image that does not fit the Word page width and gets cut off. Actually I tried the Aspose docx engine option in the compiler and it generates the same scaling, so Keith is probably just trying to retain compatibility with the Aspose engine for backwards compatibility.

Pandoc docx writer has much more useful behaviour, in that it rescales images that do not specify dimensions so that they fit the width of the Word page as a default. This is also how the Scrivener editor visualises images in page view. Why wouldn’t you want this as the default behaviour?

The performance improvement between Keith’s converter and the Aspose one is striking!

xiamenese · May 5, 2019, 4:00pm

I always load images in the retina-aware mode, though sometimes I then magnify to 200% for precision manipulation.

Mmm. I have tried compiling a project which includes images to .docx. Apart from the fact that my version of Word, at least, chokes on images which have spaces in their names, something for which I understand Keith has already set up a workaround for the next update, interestingly only one image, downloaded from the web displayed improperly, not reducing to fit within the right margin; it had dimensions which were easy enough to correct—even for someone who loathes Word and doesn’t understand its arcana! So I just wondered if the problem you identify might not be more on Word’s part. Even if it is, I would trust Keith to find a workaround.

Agree 100%

Mark

KB · May 7, 2019, 8:00am

nontroppo - I can’t reproduce this on my end. Are you using any of the resizing or scaling options in the Compile options, for instance the option to scale the image to the page width? Could you provide a sample project? Thanks.

nontroppo · May 8, 2019, 2:10am

Actually I didn’t see this compile setting before so was turned off… So we can expect the image is not scaled to the page width, but why is it at 200%, rather than 100%? Anyway the setting solves the real issue, but if you want to scratch an itch on the 200%, I include a test project.

Test.scriv.zip (431 KB)

KB · May 8, 2019, 2:42pm

Thanks for the test file. It seems that the difference is down to using a linked image that is then embedded during the Compile process. However, I’m completely baffled as to how it is happening - I think I have to call shenanigans on Word for this one. Word must be examining the images in some odd ways and finding differences in their dimensions that aren’t obvious. I come to this conclusion because I tried embedding an image and also inserting it as a linked image, and the data for both came out exactly the same in Word, and yet Word reported it as 200%.

You can see that this doesn’t seem to be Scrivener’s doing yourself:

Generate the Word .docx file.
Change the extension .docx to .zip in the Finder.
Unzip the file using something like Stuffit Expander (Apple’s own archive utility most likely won’t unzip it properly).

Now drill down, open word/document.xml and for <w:drawing> data. Check out the “wp:extent” element which determines the size of the image in Word. For your image, this is given as:

<wp:extent cx="28905200" cy="15697200"/>

The sizes here are given in EMUs, which are 12700 to a point.

Now open FleschFig1.png from the word /media folder and examine its dimensions in e.g. Preview. You’ll see that it is 72dpi with a size of 2276 x 1236.

Well:
2276 x 12700 = 28905200
1236 x 12700 = 15697200

So, we have a 72dpi file and its size is set correctly. So why is it reporting it as scaled to 200%?

It gets weirder when you trying the same to the attached project, which has the same image embedded in different ways. If you examine the images and <w:drawing> info in the unzipped exported Word file, you’ll see that the images are both 72dpi with the same dimensions, the size given in <w:drawing> is the same, and yet Word reports one as being 200% and the other as 100% (but they look the same in the editor).

Bizarre!
ImgSizeTest.zip (122 KB)

nontroppo · May 9, 2019, 12:30am

Yes, something strange. There are differences in the embedded/linked images though (unzipped word/media folder):

Screenshot 2019-05-09 at 08.07.05.png

The linked imaged (ZXScrivener-1.jpg) gets compressed while the embedded one isn’t. Using exiftool (install with homebrew, or sno.phy.queensu.ca/~phil/ex … index.html), there are differences in the metadata for the two files, the most significant is Resolution Unit is set to none in the compressed file, but inches in the original:

File Size : 21 kB File Name : ZXScrivener-1.jpg File Type : JPEG File Type Extension : jpg MIME Type : image/jpeg JFIF Version : 1.01 Resolution Unit : None X Resolution : 72 Y Resolution : 72 Exif Byte Order : Big-endian (Motorola, MM) Create Date : 2010:02:02 00:36:20 Color Space : sRGB Exif Image Width : 580 Exif Image Height : 484 XMP Toolkit : XMP Core 5.4.0

File Size : 60 kB File Name : ZXScrivener.jpg File Type : JPEG File Type Extension : jpg MIME Type : image/jpeg JFIF Version : 1.02 Exif Byte Order : Big-endian (Motorola, MM) Orientation : Horizontal (normal) X Resolution : 72 Y Resolution : 72 Resolution Unit : inches Software : Adobe Photoshop CS4 Macintosh Modify Date : 2010:02:02 00:36:20 Color Space : sRGB Exif Image Width : 580 Exif Image Height : 484 Compression : JPEG (old-style) Thumbnail Offset : 332 Thumbnail Length : 5977 IPTC Digest : 00000000000000000000000000000000 Displayed Units X : inches Displayed Units Y : inches

Lots of other colourspace metadata is also removed in the compressed one, but I suspect the resolution unit difference is the more likely, now whether that is what triggers word to show 200%, or whether it even makes sense for Word to do so is another matter!

Aspose compressed BOTH embedded/linked images, and it results in significantly worse image quality (look at the JPG compression artifacts):

I don’t really understand why Scrivener compresses the linked image but not the embedded one, or why Aspose compresses both, if there is no resizing in the editor and no rescaling in the compiler?

KB · May 9, 2019, 9:39am

I’m not sure why Aspose compresses both - that must be down to Aspose as the same text is passed through to both converters. I can answer why this happens in Scrivener, though - it’s down to a technicality. To embed the linked images into the text before export, I have to load their data and create an embedded image wrapper for them. If the image has been resized, I have to resize the actual image (changing the resolution) in order for it to work (it’s the only way to support image resizing). At this point, I have to create a new version of the image data.

But here’s the rub: there is no way (at least no way I can find) in Cocoa of retrieving the original compression from a JPEG file. So when new JPEG data is created, I have to apply an arbitrary compression factor. In Scrivener, you can set this via the Sharing > Export preferences.

There is one hitch in the way I’m doing things at the moment: the code that embeds linked images always generates the JPEG data from the resizing code, even if there is no need to resize (i.e. it just passes in a scale factor of 1.0). I’ve fixed this for the next update so that if there is no need to alter the size of the image, the data used by the embedded image will just be that of the original JPEG. This solves the problem of the 200% in your example file. As soon as an image is resized, though, Scrivener has to use the JPEG compression setting from the Preferences.

Note: this only applies to JPEG files, of course.

All the best,
Keith

nontroppo · May 9, 2019, 11:21am

OK makes sense! Exiftool doesn’t show what the JPEG compression is either, and what I gleaned online is this is not part of the JFIF spec to store this info at all. It probably doesn’t matter too much anyway as JPEG is lossy, so whenever a 75% JPEG is saved again at 75%, it will always gradually become worse AFAIK.

KB · May 9, 2019, 1:14pm

That’s a good point. What’s frustrating, though, is that if you save freshly-created JPG data in Cocoa with no compression, it ends up bigger - a 60kb file altered only to have its dpi changed becomes 120kb. Otherwise I could save without compression, but as it is I have to put an option in the Preferences.

jpkell · June 7, 2019, 4:42pm

Hi Keith,

I am a longtime Scrivener user, who is now writing a scholarly book with it. The book has many equations and symbols, for which I used MathType integrated with Scrivener. But I have encountered a huge problem that I think your new converters could help with (and from reading the forums, I’m not the only one who could use this).

When one’s manuscript contains equations and symbols, it turns out that publishers want either a LaTex document or a Word document with equations and symbols editable with Word’s Equation Editor. But when one compiles a Scrivener document that includes MathType equations and symbols to docx, the equations and symbols become images that are not editable with Equation Editor.

After spending a day yesterday trying to figure out a workaround, the best I can think of is to (re)write the book in Markdown with Scrivener, inserting where necessary the LaTex equation/symbol code that I can copy from MathType. I confirmed that I can use pandoc to convert a markdown file containing LaTex equation code to a docx file containing an equation that is editable in Word’s Equation Editor. So I know it’s possible to convert the LaTex equation syntax to Word’s Equation Editor. What would be AMAZING-AND-I-WOULD-BE-FOREVER-GRATEFUL is if I could write in Scrivener as usual, with all of Scrivener’s usual rich text, but instead of inserting MathType equations as images, I would insert them as LaTex code, and then your tools would use Pandoc (or equivalent) to render the equations as editable with Equation Editor when compiling to docx from within Scrivener.

Is there any chance at all this could be included in your new docx converter tools?? Thanks!

nontroppo · June 24, 2019, 9:55am

@KB — I think I’ve found a bug with Style export. Please download the attached Project from this post: https://forum.literatureandlatte.com/t/compiling-to-word-with-styles/39428/2

With the Aspose converter, I get a full outline with Headings 1–3 (as in the screenshot in that post). But with the new converter only the first “Heading 1” Part is exported, the other parts are just “Normal” styles, thus no outlines:

nontroppo · July 1, 2019, 2:42am

FYI: the Heading 1 export bug detailed in the previous post persists in V3.1.3…

KB · July 1, 2019, 1:12pm

Works fine for me. I downloaded and exported your project using my converters in 3.1.3, and I see the exact outline from your screenshot. I’ve attached the exported Word file.
Heading Test.zip (7.8 KB)