You're on your way to the next level! Join the Kudos program to earn points and save your progress.
Level 1: Seed
25 / 150 points
1 badge earned
Challenges come and go, but your rewards stay with you. Do more to earn more!
What goes around comes around! Share the love by gifting kudos to your peers.
Keep earning points to reach the top of the leaderboard. It resets every quarter so you always have a chance!
Join now to unlock these features and more
My goal is to use the Confluence API to get the content of a page, parse it, edit it, and update that same page with the edited content.
At first, I assumed Confluence's storage format was HTML. Based on that, my original plan was to use Python's BeautifulSoup module to parse and edit the content once I retrieved it from the Confluence API. I now know that the Confluence page storage format is "XHTML-based". I've tried to parse it with various BeautifulSoup parsers (lxml, xml, html.parser) but they all get caught on the standard-breaking macro elements like this:
<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="283daa7d-46af-4d6d-a177-a00b4a2bc342"><ac:plain-text-body><![CDATA[*\[CDRL:\]*]]></ac:plain-text-body></ac:structured-macro>
Is there a preferred method for parsing this?
This might not be the solution you desire; however, If you are interested in writing Confluence Wiki text to docx format (while maintaining the wiki formats), you can try jirawiki2docx python library.