Create
cancel
Showing results for 
Search instead for 
Did you mean: 
Sign up Log in
Celebration

Earn badges and make progress

You're on your way to the next level! Join the Kudos program to earn points and save your progress.

Deleted user Avatar
Deleted user

Level 1: Seed

25 / 150 points

Next: Root

Avatar

1 badge earned

Collect

Participate in fun challenges

Challenges come and go, but your rewards stay with you. Do more to earn more!

Challenges
Coins

Gift kudos to your peers

What goes around comes around! Share the love by gifting kudos to your peers.

Recognition
Ribbon

Rise up in the ranks

Keep earning points to reach the top of the leaderboard. It resets every quarter so you always have a chance!

Leaderboard

Is there a Python library that can parse Confluence page content?

My goal is to use the Confluence API to get the content of a page, parse it, edit it, and update that same page with the edited content.

At first, I assumed Confluence's storage format was HTML. Based on that, my original plan was to use Python's BeautifulSoup module to parse and edit the content once I retrieved it from the Confluence API. I now know that the Confluence page storage format is "XHTML-based". I've tried to parse it with various BeautifulSoup parsers (lxml, xml, html.parser) but they all get caught on the standard-breaking macro elements like this:

<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="283daa7d-46af-4d6d-a177-a00b4a2bc342"><ac:plain-text-body><![CDATA[*\[CDRL:\]*]]></ac:plain-text-body></ac:structured-macro>

 

Is there a preferred method for parsing this?

4 answers

I am having the same issue! If you found a solution could you share? Thanks!

I never found a parser that worked perfectly for Confluence. Ultimately, I was forced to edit Confluence pages using regular expressions.

I have had the same issue.  Ultimately, I have resorted to using an html parser (where that works), an xml parser (where that works), and regex for everything else.

John Larring
I'm New Here
I'm New Here
Those new to the Atlassian Community have posted less than three times. Give them a warm welcome!
Jun 26, 2023

Same issue here. Need to find a way to parse Confluence XHTML incl. macro notation.

1 vote
Bill Bailey
Rising Star
Rising Star
Rising Stars are recognized for providing high-quality answers to other users. Rising Stars receive a certificate of achievement and are on the path to becoming Community Leaders.
Dec 03, 2021

Hey Bill, thanks for your response.

Alas, while these 2 libraries are helpful wrappers for the Confluence API, as far as I can tell, neither have the ability to parse the XHTML that is pulled from the API representing the content of a page.

0 votes
Ifeoluwa Akande
I'm New Here
I'm New Here
Those new to the Atlassian Community have posted less than three times. Give them a warm welcome!
Jul 09, 2023 • edited

This might not be the solution you desire; however, If you are interested in writing Confluence Wiki text to docx format (while maintaining the wiki formats), you can try jirawiki2docx python library.

https://pypi.org/project/jirawiki2docx/

 

@Woodson Miles I am using ConfluencePS PowerShell module to achieve this. It's quite simple to fetch and upload page contents using Get-ConfluencePage and Set-ConfluencePage commands.

You can get this module from PowerShell Gallery.

Suggest an answer

Log in or Sign up to answer
TAGS
AUG Leaders

Atlassian Community Events