We have a set of old Microsoft Word documents (primarily .doc) that we want to import into Confluence (5.9.12). Most content imports OK, but Word's "special characters" that were inserted as symbols do not. For example, a μ (mu) symbol from these documents shows up as in the Confluence web interface. I can import a test .doc file with both a proper unicode μ and the non-functional from the word documents. The unicode works where Word's "symbol" doesn't. So, it seems to be a problem of handling whatever Microsoft Word is doing when it stores these special characters. Does anyone know of a way that Confluence could handle this, or failing that, that we could convert these goofy characters into their unicode equivalents before importing?
A bit more gory detail on my troubleshooting:
If I copy and paste the into a text file and check the contents on that one-character file byte-for-byte, I see 0xef81ad, which matches what I get if I copy the character directly from the Word document. I can also do a manual search-and-replace in the Confluence web interface for that specific (literally pasting in the box symbol) and put μ in its place, and the replacement leaves alone the other identical-looking but different special characters (like for a degree symbol). So it does seem that Confluence has all the information after importing, the display is just garbled since it doesn't know that 0xef81ad should be shown as a mu character. I'm playing around with the XML-RPC API to see if I can do a batch search-and-replace, but then I still need to figure out all the possible characters we'd run into and make sure I can actually get at that weird text via the API.
Thanks in advance for any ideas,
Are you using MySQL as the underlying database for your Confluence instance? If so, you could be impacted by: MySQL databases incapabable of handling 4byte UTF-8 Characters. Confluence should handle this gracefully
It looks like some folks experienced Word import issues due to database collation as well:
Thanks Ann. We're on MariaDB but configured for UTF8. These symbols look like three bytes and I don't have any errors matching "Incorrect string" in the logs, so I don't think it looks like we're suffering from that problem.
Everything's running smoothly right up until the special characters are presented to the web browser, but then it has no knowlege of how to render it. When I looked into it more just now, I found that Word is apparently using one of the "private use areas" in unicode, which are by definition left undefined. Here's that character being used by Word's mu symbol:
So it seems like any attempt to import these special characters would need to understand what Microsoft's custom encoding is to handle them properly. Any chance Confluence's import feature can do that? Or are we stuck with some kind of search-and-replace to get "real unicode"? Thanks!
I also put in a ticket -- CSP-208778 -- before I saw your reply here, thinking this was more likely something Atlassian could help us with directly, but thanks for the quick response on this side.
My suport request ticket led to a bug ticket, so it looks like this is a limitation of the import process after all, whatever the database used:
So I suppose the answer to my question for the time being is, perform a search-and-replace on any strange characters, at least until issue 52857 is fixed somehow. Thanks for your help!
Do you use templates with Confluence? Take part in a remote 1-hr workshop. You'll receive USD $100 for your time! We're looking for people to participate in a remote 1-hr workshop...
Connect with like-minded Atlassian users at free events near you!Find a group
Connect with like-minded Atlassian users at free events near you!
Unfortunately there are no AUG chapters near you at the moment.Start an AUG
You're one step closer to meeting fellow Atlassian users at your local meet up. Learn more about AUGs