Create
cancel
Showing results for 
Search instead for 
Did you mean: 
Sign up Log in

How can I handle special characters when importing Word documents?

jesse08 July 11, 2017

We have a set of old Microsoft Word documents (primarily .doc) that we want to import into Confluence (5.9.12).  Most content imports OK, but Word's "special characters" that were inserted as symbols do not.  For example, a μ (mu) symbol from these documents shows up as  in the Confluence web interface.  I can import a test .doc file with both a proper unicode μ and the non-functional  from the word documents.  The unicode works where Word's "symbol" doesn't.  So, it seems to be a problem of handling whatever Microsoft Word is doing when it stores these special characters.  Does anyone know of a way that Confluence could handle this, or failing that, that we could convert these goofy characters into their unicode equivalents before importing?

A bit more gory detail on my troubleshooting:

If I copy and paste the  into a text file and check the contents on that one-character file byte-for-byte, I see 0xef81ad, which matches what I get if I copy the character directly from the Word document.  I can also do a manual search-and-replace in the Confluence web interface for that specific  (literally pasting in the box symbol) and put μ in its place, and the replacement leaves alone the other identical-looking but different special characters (like for a degree symbol).  So it does seem that Confluence has all the information after importing, the display is just garbled since it doesn't know that 0xef81ad should be shown as a mu character.  I'm playing around with the XML-RPC API to see if I can do a batch search-and-replace, but then I still need to figure out all the possible characters we'd run into and make sure I can actually get at that weird text via the API.

Thanks in advance for any ideas,

Jesse

2 answers

0 votes
jesse08 July 13, 2017

My suport request ticket led to a bug ticket, so it looks like this is a limitation of the import process after all, whatever the database used:

https://jira.atlassian.com/browse/CONFSERVER-52857

So I suppose the answer to my question for the time being is, perform a search-and-replace on any strange characters, at least until issue 52857 is fixed somehow.  Thanks for your help!

AnnWorley
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
July 13, 2017

Thank you so much for circling back to let the Community know the outcome!

0 votes
AnnWorley
Atlassian Team
Atlassian Team members are employees working across the company in a wide variety of roles.
July 11, 2017

Are you using MySQL as the underlying database for your Confluence instance? If so, you could be impacted by: MySQL databases incapabable of handling 4byte UTF-8 Characters. Confluence should handle this gracefully

It looks like some folks experienced Word import issues due to database collation as well:

 

jesse08 July 11, 2017

Thanks Ann.  We're on MariaDB but configured for UTF8.  These symbols look like three bytes and I don't have any errors matching "Incorrect string" in the logs, so I don't think it looks like we're suffering from that problem.

Everything's running smoothly right up until the special characters are presented to the web browser, but then it has no knowlege of how to render it.  When I looked into it more just now, I found that Word is apparently using one of the "private use areas" in unicode, which are by definition left undefined.  Here's that character being used by Word's mu symbol:

https://unicode-table.com/en/F06D/

So it seems like any attempt to import these special characters would need to understand what Microsoft's custom encoding is to handle them properly.  Any chance Confluence's import feature can do that?  Or are we stuck with some kind of search-and-replace to get "real unicode"?  Thanks!

I also put in a ticket -- CSP-208778 -- before I saw your reply here, thinking this was more likely something Atlassian could help us with directly, but thanks for the quick response on this side.

Suggest an answer

Log in or Sign up to answer
TAGS
AUG Leaders

Atlassian Community Events