How can I handle special characters when importing Word documents?

We have a set of old Microsoft Word documents (primarily .doc) that we want to import into Confluence (5.9.12).  Most content imports OK, but Word's "special characters" that were inserted as symbols do not.  For example, a μ (mu) symbol from these documents shows up as  in the Confluence web interface.  I can import a test .doc file with both a proper unicode μ and the non-functional  from the word documents.  The unicode works where Word's "symbol" doesn't.  So, it seems to be a problem of handling whatever Microsoft Word is doing when it stores these special characters.  Does anyone know of a way that Confluence could handle this, or failing that, that we could convert these goofy characters into their unicode equivalents before importing?

A bit more gory detail on my troubleshooting:

If I copy and paste the  into a text file and check the contents on that one-character file byte-for-byte, I see 0xef81ad, which matches what I get if I copy the character directly from the Word document.  I can also do a manual search-and-replace in the Confluence web interface for that specific  (literally pasting in the box symbol) and put μ in its place, and the replacement leaves alone the other identical-looking but different special characters (like for a degree symbol).  So it does seem that Confluence has all the information after importing, the display is just garbled since it doesn't know that 0xef81ad should be shown as a mu character.  I'm playing around with the XML-RPC API to see if I can do a batch search-and-replace, but then I still need to figure out all the possible characters we'd run into and make sure I can actually get at that weird text via the API.

Thanks in advance for any ideas,

Jesse

2 answers

This widget could not be displayed.
Ann Worley Atlassian Team Jul 11, 2017

Are you using MySQL as the underlying database for your Confluence instance? If so, you could be impacted by: MySQL databases incapabable of handling 4byte UTF-8 Characters. Confluence should handle this gracefully

It looks like some folks experienced Word import issues due to database collation as well:

 

Thanks Ann.  We're on MariaDB but configured for UTF8.  These symbols look like three bytes and I don't have any errors matching "Incorrect string" in the logs, so I don't think it looks like we're suffering from that problem.

Everything's running smoothly right up until the special characters are presented to the web browser, but then it has no knowlege of how to render it.  When I looked into it more just now, I found that Word is apparently using one of the "private use areas" in unicode, which are by definition left undefined.  Here's that character being used by Word's mu symbol:

https://unicode-table.com/en/F06D/

So it seems like any attempt to import these special characters would need to understand what Microsoft's custom encoding is to handle them properly.  Any chance Confluence's import feature can do that?  Or are we stuck with some kind of search-and-replace to get "real unicode"?  Thanks!

I also put in a ticket -- CSP-208778 -- before I saw your reply here, thinking this was more likely something Atlassian could help us with directly, but thanks for the quick response on this side.

This widget could not be displayed.

My suport request ticket led to a bug ticket, so it looks like this is a limitation of the import process after all, whatever the database used:

https://jira.atlassian.com/browse/CONFSERVER-52857

So I suppose the answer to my question for the time being is, perform a search-and-replace on any strange characters, at least until issue 52857 is fixed somehow.  Thanks for your help!

Ann Worley Atlassian Team Jul 13, 2017

Thank you so much for circling back to let the Community know the outcome!

Suggest an answer

Log in or Sign up to answer
Community showcase
Posted Sep 17, 2018 in Confluence

Why start from scratch? Introducing four new templates for Confluence Cloud

Hi my Community friends!  For those who don't know me, I'm a product marketer on the Confluence Cloud team - nice to meet you! For those of you who do, you know that I've been all up in your Co...

564 views 7 6
Join discussion

Atlassian User Groups

Connect with like-minded Atlassian users at free events near you!

Find a group

Connect with like-minded Atlassian users at free events near you!

Find my local user group

Unfortunately there are no AUG chapters near you at the moment.

Start an AUG

You're one step closer to meeting fellow Atlassian users at your local meet up. Learn more about AUGs

Groups near you