Community
Products
Confluence
Questions
Is there a Python library that can parse Confluence page content?

Is there a Python library that can parse Confluence page content?

My goal is to use the Confluence API to get the content of a page, parse it, edit it, and update that same page with the edited content.

At first, I assumed Confluence's storage format was HTML. Based on that, my original plan was to use Python's BeautifulSoup module to parse and edit the content once I retrieved it from the Confluence API. I now know that the Confluence page storage format is "XHTML-based". I've tried to parse it with various BeautifulSoup parsers (lxml, xml, html.parser) but they all get caught on the standard-breaking macro elements like this:

<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="283daa7d-46af-4d6d-a177-a00b4a2bc342"><ac:plain-text-body><![CDATA[*\[CDRL:\]*]]></ac:plain-text-body></ac:structured-macro>

Is there a preferred method for parsing this?

5 answers

2 votes

I am having the same issue! If you found a solution could you share? Thanks!

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

I never found a parser that worked perfectly for Confluence. Ultimately, I was forced to edit Confluence pages using regular expressions.

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

Like

I have had the same issue. Ultimately, I have resorted to using an html parser (where that works), an xml parser (where that works), and regex for everything else.

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

Like

Same issue here. Need to find a way to parse Confluence XHTML incl. macro notation.

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

Like

I am using regex to modify confluence pages.

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

Like

Reply

2 votes

Have you seen these:

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

Hey Bill, thanks for your response.

Alas, while these 2 libraries are helpful wrappers for the Confluence API, as far as I can tell, neither have the ability to parse the XHTML that is pulled from the API representing the content of a page.

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

Like

Reply

1 vote

co-pilot suggested the following which I can confirm worked:

import requests
from requests.auth import HTTPBasicAuth
from bs4 import BeautifulSoup

# Replace these variables with your Confluence details
confluence_url = 'https://<my site>.atlassian.net/wiki/rest/api/content/'
page_id = '<my page id>' # Replace with your page ID
username = '<my username>'
api_token = '<my token>'

def get_confluence_page_text(page_id):
url = f"{confluence_url}{page_id}?expand=body.view"
auth = HTTPBasicAuth(username, api_token)

response = requests.get(url, auth=auth)

if response.status_code == 200:
page_data = response.json()
page_html = page_data['body']['view']['value']

# Use BeautifulSoup to extract text from HTML
soup = BeautifulSoup(page_html, 'html.parser')
page_text = soup.get_text()

return page_text
else:
raise Exception(f"Failed to fetch page: {response.status_code} - {response.text}")

if __name__ == "__main__":
try:
page_text = get_confluence_page_text(page_id)
print(page_text)
except Exception as e:
print(e)

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

Reply

0 votes

This might not be the solution you desire; however, If you are interested in writing Confluence Wiki text to docx format (while maintaining the wiki formats), you can try jirawiki2docx python library.

https://pypi.org/project/jirawiki2docx/

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

Reply

0 votes

@Woodson Miles I am using ConfluencePS PowerShell module to achieve this. It's quite simple to fetch and upload page contents using Get-ConfluencePage and Set-ConfluencePage commands.

You can get this module from PowerShell Gallery.

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

Reply

Suggest an answer

Log in or Sign up to answer

Was this helpful?

Thanks!

TAGS