Create
cancel
Showing results for 
Search instead for 
Did you mean: 
Sign up Log in

Is there a Python library that can parse Confluence page content?

Woodson Miles December 3, 2021

My goal is to use the Confluence API to get the content of a page, parse it, edit it, and update that same page with the edited content.

At first, I assumed Confluence's storage format was HTML. Based on that, my original plan was to use Python's BeautifulSoup module to parse and edit the content once I retrieved it from the Confluence API. I now know that the Confluence page storage format is "XHTML-based". I've tried to parse it with various BeautifulSoup parsers (lxml, xml, html.parser) but they all get caught on the standard-breaking macro elements like this:

<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="283daa7d-46af-4d6d-a177-a00b4a2bc342"><ac:plain-text-body><![CDATA[*\[CDRL:\]*]]></ac:plain-text-body></ac:structured-macro>

 

Is there a preferred method for parsing this?

4 answers

2 votes
Stephen M June 8, 2022

I am having the same issue! If you found a solution could you share? Thanks!

Woodson Miles June 13, 2022

I never found a parser that worked perfectly for Confluence. Ultimately, I was forced to edit Confluence pages using regular expressions.

Steve Sadler September 15, 2022

I have had the same issue.  Ultimately, I have resorted to using an html parser (where that works), an xml parser (where that works), and regex for everything else.

John Larring June 26, 2023

Same issue here. Need to find a way to parse Confluence XHTML incl. macro notation.

Evgeny November 30, 2023

I am using regex to modify confluence pages.

2 votes
Bill Bailey
Rising Star
Rising Star
Rising Stars are recognized for providing high-quality answers to other users. Rising Stars receive a certificate of achievement and are on the path to becoming Community Leaders.
December 3, 2021
Woodson Miles December 6, 2021

Hey Bill, thanks for your response.

Alas, while these 2 libraries are helpful wrappers for the Confluence API, as far as I can tell, neither have the ability to parse the XHTML that is pulled from the API representing the content of a page.

0 votes
Ifeoluwa Akande July 9, 2023

This might not be the solution you desire; however, If you are interested in writing Confluence Wiki text to docx format (while maintaining the wiki formats), you can try jirawiki2docx python library.

https://pypi.org/project/jirawiki2docx/

 

0 votes
Uzair Ansari September 16, 2022

@Woodson Miles I am using ConfluencePS PowerShell module to achieve this. It's quite simple to fetch and upload page contents using Get-ConfluencePage and Set-ConfluencePage commands.

You can get this module from PowerShell Gallery.

Suggest an answer

Log in or Sign up to answer
TAGS
AUG Leaders

Atlassian Community Events